Re: [nupic-dev] HTM in Natural Language Processing

James Tauber Wed, 28 Aug 2013 11:22:06 -0700

I plan to work on it tonight and will commit the python scripts I write for
them to my repo.


James


On Wed, Aug 28, 2013 at 2:16 PM, Matthew Taylor <[email protected]> wrote:

> If anyone starts to work on tasks in Francisco's list of statistical
> characteristics, please reply here so we don't have any duplication of work.
>
> ---------
> Matt Taylor
> OS Community Flag-Bearer
> Numenta
>
>
> On Wed, Aug 28, 2013 at 10:14 AM, Francisco Webber <[email protected]>wrote:
>
>> James, Thats great!
>>
>> A next step would be to calculate some statistical characteristics of the
>> collection.
>> Typically:
>>
>> - Size in Bytes of the Collection
>> - Size in Bytes of each Document
>> - Word count of the Collection (punctuation signs should count a words
>> too)
>> - Word count of each Document (idem)
>> - Wordlist of the Collection (each occurring word has an entry)
>> - Wordlist of each Document (idem)
>> - Coverage of vocabulary of each Document in percent of the Collection
>> vocabulary (maybe also unique vocabulary for each Document)
>>
>> The last line will tell us if the coverage is evenly distributed over the
>> different documents. We might eliminate some of them from the list if they
>> don't match.
>>
>> In the end we could make a script that gives each of the calculated items
>> a speaking name, casts it as a constant and generates an include file. This
>> makes it easy to create the evaluation code later.
>>
>> Francisco
>>
>>
>> On 28.08.2013, at 18:47, James Tauber wrote:
>>
>> I've actually moved the texts to a full-blown GitHub repo:
>>
>> https://github.com/jtauber/nupic-texts
>>
>>  so feel free to log issues against it if other changes are necessary
>> and/or fork and do pull requests if you want to change/add anything.
>>
>> James
>>
>>
>> On Tue, Aug 27, 2013 at 1:54 PM, James Tauber <[email protected]>wrote:
>>
>>> All done:
>>>
>>> https://gist.github.com/jtauber/6347309
>>>
>>>
>>>
>>>
>>> On Tue, Aug 27, 2013 at 12:35 PM, James Tauber <[email protected]>wrote:
>>>
>>>> yep, I'm working on it :-)
>>>>
>>>>
>>>> On Tue, Aug 27, 2013 at 12:29 PM, Francisco Webber <[email protected]>wrote:
>>>>
>>>>> yes James that looks perfect.
>>>>> great job!
>>>>> Now we need the other tales in the same format.
>>>>>
>>>>> Francisco
>>>>>
>>>>> On 27.08.2013, at 15:14, James Tauber wrote:
>>>>>
>>>>> Let me know if this is what you had in mind (just the ugly duckling):
>>>>>
>>>>> https://gist.github.com/jtauber/6347309#file-the_ugly_duckling-txt
>>>>>
>>>>> I put each paragraph on its own line and separated the sections (that
>>>>> formerly were separated by a row of asterisks) with a blank line.
>>>>>
>>>>> James
>>>>>
>>>>>
>>>>> On Tue, Aug 27, 2013 at 7:59 AM, Francisco De Sousa Webber <
>>>>> [email protected]> wrote:
>>>>>
>>>>>> James,
>>>>>> thats great!
>>>>>> I think that there are some more preparations necessary:
>>>>>> - All CRLF should be removed. Keeping one blank after each full stop.
>>>>>> (This makes it easier for most parsers)
>>>>>> - The line of asterisks should be replaced by a CRLF to mark the
>>>>>> paragraphs. (We never know but we could need paragraph info at some time)
>>>>>> - The file as such should be split into single tales. (Whatever
>>>>>> experiments we run, if we rerun them with different tales, results become
>>>>>> more comparable)
>>>>>> - The title should not be written in caps. (Capital letter+Full Stop
>>>>>> is interpreted as acronym or middle name instead of a sentence delimiter)
>>>>>>
>>>>>> Francisco
>>>>>>
>>>>>>
>>>>>> Am 27.08.2013 um 00:22 schrieb James Tauber <[email protected]>:
>>>>>>
>>>>>> I've removed the metadata, the vocab lists and the illustrations:
>>>>>>
>>>>>> https://gist.github.com/jtauber/6347309
>>>>>>
>>>>>> James
>>>>>>
>>>>>>
>>>>>> On Mon, Aug 26, 2013 at 2:10 PM, Jeff Hawkins 
>>>>>> <[email protected]>wrote:
>>>>>>
>>>>>>> I am sold on the kid’s story idea.  I looked at the link below and
>>>>>>> there is a lot of meta data in this file.  It would have to be removed
>>>>>>> before feeding to the CLA.****
>>>>>>>
>>>>>>> ** **
>>>>>>>
>>>>>>> My assumption is that we would need a CLA with more columns than the
>>>>>>> standard 2048.  How many bits are in your word fingerprints?  Could we 
>>>>>>> make
>>>>>>> each bit a column and skip the SP?****
>>>>>>>
>>>>>>> Jeff****
>>>>>>>
>>>>>>> ** **
>>>>>>>
>>>>>>> *From:* nupic [mailto:[email protected]] *On Behalf
>>>>>>> Of *Francisco Webber
>>>>>>> *Sent:* Monday, August 26, 2013 3:50 AM
>>>>>>>
>>>>>>> *To:* NuPIC general mailing list.
>>>>>>> *Subject:* Re: [nupic-dev] HTM in Natural Language Processing****
>>>>>>>
>>>>>>> ** **
>>>>>>>
>>>>>>> Ian,****
>>>>>>>
>>>>>>> I also thought about something from the Gutenberg repository.****
>>>>>>>
>>>>>>> But I think we should start with something from the Kids Shelf.****
>>>>>>>
>>>>>>> ** **
>>>>>>>
>>>>>>> There are several reasons in my opinion:****
>>>>>>>
>>>>>>> ** **
>>>>>>>
>>>>>>> - We start experimentation with a full bag of unknown parameters, so
>>>>>>> keeping the test material simple would allow us to detect the important
>>>>>>> ones sooner. And it is quite some work to create a reliable evaluation
>>>>>>> framework, so the size of the data set makes a difference.****
>>>>>>>
>>>>>>> - Keeping the text simple and short reduces substantially the
>>>>>>> overall vocabulary. If we want people to also evaluate offline, matching
>>>>>>> fingerprints can become a lengthy process without an efficient 
>>>>>>> similarity
>>>>>>> engine.****
>>>>>>>
>>>>>>> - Another reason is the fact that we don't know how much a given set
>>>>>>> of columns (like the 2048 typically used) can absorb information. In 
>>>>>>> other
>>>>>>> words: what is the optimal ratio between a first layer of a text-HTM and
>>>>>>> the amount of text.****
>>>>>>>
>>>>>>> - Lastly I believe that the sequence in which text is presented to
>>>>>>> the CLA is of importance. After all when humans learn information by
>>>>>>> reading, they also start from simple to complex language. The amount of 
>>>>>>> new
>>>>>>> vocabulary during training, should be relatively stable (the actual 
>>>>>>> amount
>>>>>>> would probably be linked to the ratio of my previous argument) ****
>>>>>>>
>>>>>>> ** **
>>>>>>>
>>>>>>> So we should build continuously more complex training data sets,
>>>>>>> finally ending up with "true"  books like the ones you listed.****
>>>>>>>
>>>>>>> ** **
>>>>>>>
>>>>>>> To start I would suggest something like:****
>>>>>>>
>>>>>>> ** **
>>>>>>>
>>>>>>> A Primary Reader: Old-time Stories, Fairy Tales and Myths Retold by
>>>>>>> Children****
>>>>>>>
>>>>>>> http://www.gutenberg.org/ebooks/7841****
>>>>>>>
>>>>>>> ** **
>>>>>>>
>>>>>>> But there might still be better ones…****
>>>>>>>
>>>>>>> ** **
>>>>>>>
>>>>>>> Francisco****
>>>>>>>
>>>>>>> ** **
>>>>>>>
>>>>>>>  ****
>>>>>>>
>>>>>>> ** **
>>>>>>>
>>>>>>> On 25.08.2013, at 23:05, Ian Danforth wrote:****
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> ****
>>>>>>>
>>>>>>> I will make 3 suggestions. All are out of copyright, well known,
>>>>>>> uncontroversial, and still taught in schools (At least in the US)***
>>>>>>> *
>>>>>>>
>>>>>>> ** **
>>>>>>>
>>>>>>> 1. Robinson Crusoe - Daniel Defoe****
>>>>>>>
>>>>>>> ** **
>>>>>>>
>>>>>>> http://www.gutenberg.org/ebooks/521****
>>>>>>>
>>>>>>> ** **
>>>>>>>
>>>>>>> 2. Great Expectations - Charles Dickens****
>>>>>>>
>>>>>>> ** **
>>>>>>>
>>>>>>> http://www.gutenberg.org/ebooks/1400****
>>>>>>>
>>>>>>> ** **
>>>>>>>
>>>>>>> 3. The Time Machine - H.G. Wells****
>>>>>>>
>>>>>>> ** **
>>>>>>>
>>>>>>> http://www.gutenberg.org/ebooks/35****
>>>>>>>
>>>>>>> ** **
>>>>>>>
>>>>>>> Ian****
>>>>>>>
>>>>>>> ** **
>>>>>>>
>>>>>>> On Sat, Aug 24, 2013 at 10:24 AM, Francisco Webber <[email protected]>
>>>>>>> wrote:****
>>>>>>>
>>>>>>> For those who don't want to use the API and for evaluation purposes,
>>>>>>> I would propose that we choose some reference text and I convert it 
>>>>>>> into a
>>>>>>> sequence of SDRs. This file could be used for training.****
>>>>>>>
>>>>>>> I would also generate a list of all words contained in the text,
>>>>>>> together with their SDRs to be used as conversion table.****
>>>>>>>
>>>>>>> As a simple test measure we could feed a sequence of SDRs into a
>>>>>>> trained network and see if the HTM makes the right prediction about the
>>>>>>> following word(s). ****
>>>>>>>
>>>>>>> The last file to produce for a complete framework would be a list of
>>>>>>> lets say 100 word sequences with their correct continuation.****
>>>>>>>
>>>>>>> The word sequences could be for example the beginnings of phrases
>>>>>>> with more than n words (n being the number of steps ahead that the CLA 
>>>>>>> can
>>>>>>> predict ahead)****
>>>>>>>
>>>>>>> This could be the beginning of a measuring set-up that allows to
>>>>>>> compare different CLA-implementation flavors.****
>>>>>>>
>>>>>>> ** **
>>>>>>>
>>>>>>> Any suggestions for a text to choose?****
>>>>>>>
>>>>>>> ** **
>>>>>>>
>>>>>>> Francisco****
>>>>>>>
>>>>>>> ** **
>>>>>>>
>>>>>>> On 24.08.2013, at 17:12, Matthew Taylor wrote:****
>>>>>>>
>>>>>>> ** **
>>>>>>>
>>>>>>> Very cool, Francisco. Here is where you can get cept API
>>>>>>> credentials: https://cept.3scale.net/signup****
>>>>>>>
>>>>>>>
>>>>>>> ****
>>>>>>>
>>>>>>> ---------****
>>>>>>>
>>>>>>> Matt Taylor****
>>>>>>>
>>>>>>> OS Community Flag-Bearer****
>>>>>>>
>>>>>>> Numenta****
>>>>>>>
>>>>>>> ** **
>>>>>>>
>>>>>>> On Fri, Aug 23, 2013 at 5:07 PM, Francisco Webber <[email protected]>
>>>>>>> wrote:****
>>>>>>>
>>>>>>> Just a short post scriptum:
>>>>>>>
>>>>>>> The public version of our API doesn't actually contain the generic
>>>>>>> conversion function. But if people from the HTM community want to
>>>>>>> experiment just click the "Request for Beta-Program" button and I will
>>>>>>> upgrade your accounts manually.
>>>>>>>
>>>>>>> Francisco****
>>>>>>>
>>>>>>>
>>>>>>> On 24.08.2013, at 01:59, Francisco Webber wrote:
>>>>>>>
>>>>>>> > Jeff,
>>>>>>> > I thought about this already.
>>>>>>> > We have a REST API where you can send a word in and get the SDR
>>>>>>> back, and vice versa.
>>>>>>> > I invite all who want to experiment to try it out.
>>>>>>> > You just need to get credentials at our website: www.cept.at.
>>>>>>> >
>>>>>>> > In mid-term it would be cool to create some sort of evaluation
>>>>>>> set, that could be used to measure progress while improving the CLA.
>>>>>>> >
>>>>>>> > We are continuously improving our Retina but the version that is
>>>>>>> currently online works pretty well already.
>>>>>>> >
>>>>>>> > I hope that will help
>>>>>>> >
>>>>>>> > Francisco
>>>>>>> >
>>>>>>> > On 24.08.2013, at 01:46, Jeff Hawkins wrote:
>>>>>>> >
>>>>>>> >> Francisco,
>>>>>>> >> Your work is very cool.  Do you think it would be possible to
>>>>>>> make available
>>>>>>> >> your word SDRs (or a sufficient subset of them) for
>>>>>>> experimentation?  I
>>>>>>> >> imagine there would be interested in the NuPIC community in
>>>>>>> training a CLA
>>>>>>> >> on text using your word SDRs.  You might get some useful results
>>>>>>> more
>>>>>>> >> quickly.  You could do this under a research only license or
>>>>>>> something like
>>>>>>> >> that.
>>>>>>> >> Jeff
>>>>>>> >>
>>>>>>> >> -----Original Message-----
>>>>>>> >> From: nupic [mailto:[email protected]] On Behalf
>>>>>>> Of Francisco
>>>>>>> >> Webber
>>>>>>> >> Sent: Wednesday, August 21, 2013 1:01 PM
>>>>>>> >> To: NuPIC general mailing list.
>>>>>>> >> Subject: Re: [nupic-dev] HTM in Natural Language Processing
>>>>>>> >>
>>>>>>> >> Hello,
>>>>>>> >> I am one of the founders of CEPT Systems and lead researcher of
>>>>>>> our retina
>>>>>>> >> algorithm.
>>>>>>> >>
>>>>>>> >> We have developed a method to represent words by a bitmap pattern
>>>>>>> capturing
>>>>>>> >> most of its "lexical semantics". (A text sensor) Our word-SDRs
>>>>>>> fulfill all
>>>>>>> >> the requirements for "good" HTM input data.
>>>>>>> >>
>>>>>>> >> - Words with similar meaning "look" similar
>>>>>>> >> - If you drop random bits in the representation the semantics
>>>>>>> remain intact
>>>>>>> >> - Only a small number (up to 5%) of bits are set in a word-SDR
>>>>>>> >> - Every bit in the representation corresponds to a specific
>>>>>>> semantic feature
>>>>>>> >> of the language used
>>>>>>> >> - The Retina (sensory organ for a HTM) can be trained on any
>>>>>>> language
>>>>>>> >> - The retina training process is fully unsupervised.
>>>>>>> >>
>>>>>>> >> We have found out that the word-SDR by itself (without using any
>>>>>>> HTM yet)
>>>>>>> >> can improve many NLP problems that are only poorly solved using
>>>>>>> the
>>>>>>> >> traditional statistic approaches.
>>>>>>> >> We use the SDRs to:
>>>>>>> >> - Create fingerprints of text documents which allows us to
>>>>>>> compare them for
>>>>>>> >> semantic similarity using simple (euclidian) similarity measures
>>>>>>> >> - We can automatically detect polysemy and disambiguate multiple
>>>>>>> meanings.
>>>>>>> >> - We can characterize any text with context terms for automatic
>>>>>>> >> search-engine query-expansion .
>>>>>>> >>
>>>>>>> >> We hope to successfully link-up our Retina to an HTM network to
>>>>>>> go beyond
>>>>>>> >> lexical semantics into the field of "grammatical semantics".
>>>>>>> >> This would hopefully lead to improved abstracting-,
>>>>>>> conversation-, question
>>>>>>> >> answering- and translation- systems..
>>>>>>> >>
>>>>>>> >> Our correct web address is www.cept.at (no kangaroos in Vienna
>>>>>>> ;-)
>>>>>>> >>
>>>>>>> >> I am interested in any form of cooperation to apply HTM
>>>>>>> technology to text.
>>>>>>> >>
>>>>>>> >> Francisco
>>>>>>> >>
>>>>>>> >> On 21.08.2013, at 20:16, Christian Cleber Masdeval Braz wrote:
>>>>>>> >>
>>>>>>> >>>
>>>>>>> >>> Hello.
>>>>>>> >>>
>>>>>>> >>> As many of you here i am prety new in HTM technology.
>>>>>>> >>>
>>>>>>> >>> I am a researcher in Brazil and I am going to start my Phd
>>>>>>> program soon.
>>>>>>> >> My field of interest is NLP and the extraction of knowledge from
>>>>>>> text. I am
>>>>>>> >> thinking to use the ideas behind the Memory Prediction Framework
>>>>>>> to
>>>>>>> >> investigate semantic information retrieval from the Web, and
>>>>>>> answer
>>>>>>> >> questions in natural language. I intend to use the HTM
>>>>>>> implementation as
>>>>>>> >> base to do this.
>>>>>>> >>>
>>>>>>> >>> I apreciate a lot if someone could answer some questions:
>>>>>>> >>>
>>>>>>> >>> - Are there some researches related to HTM and NLP? Could
>>>>>>> indicate them?
>>>>>>> >>>
>>>>>>> >>> - Is HTM proper to address this problem? Could it learn, without
>>>>>>> >> supervision, the grammar of a language or just help in some
>>>>>>> aspects as Named
>>>>>>> >> Entity Recognition?
>>>>>>> >>>
>>>>>>> >>>
>>>>>>> >>>
>>>>>>> >>> Regards,
>>>>>>> >>>
>>>>>>> >>> Christian
>>>>>>> >>>
>>>>>>> >>>
>>>>>>> >>> _______________________________________________
>>>>>>> >>> nupic mailing list
>>>>>>> >>> [email protected]
>>>>>>> >>>
>>>>>>> http://lists.numenta.org/mailman/listinfo/nupic_lists.numenta.org
>>>>>>> >>
>>>>>>> >>
>>>>>>> >> _______________________________________________
>>>>>>> >> nupic mailing list
>>>>>>> >> [email protected]
>>>>>>> >> http://lists.numenta.org/mailman/listinfo/nupic_lists.numenta.org
>>>>>>> >>
>>>>>>> >>
>>>>>>> >> _______________________________________________
>>>>>>> >> nupic mailing list
>>>>>>> >> [email protected]
>>>>>>> >> http://lists.numenta.org/mailman/listinfo/nupic_lists.numenta.org
>>>>>>> >
>>>>>>> >
>>>>>>> > _______________________________________________
>>>>>>> > nupic mailing list
>>>>>>> > [email protected]
>>>>>>> > http://lists.numenta.org/mailman/listinfo/nupic_lists.numenta.org
>>>>>>>
>>>>>>>
>>>>>>> _______________________________________________
>>>>>>> nupic mailing list
>>>>>>> [email protected]
>>>>>>> http://lists.numenta.org/mailman/listinfo/nupic_lists.numenta.org***
>>>>>>> *
>>>>>>>
>>>>>>> ** **
>>>>>>>
>>>>>>> _______________________________________________
>>>>>>> nupic mailing list
>>>>>>> [email protected]
>>>>>>> http://lists.numenta.org/mailman/listinfo/nupic_lists.numenta.org***
>>>>>>> *
>>>>>>>
>>>>>>> ** **
>>>>>>>
>>>>>>>
>>>>>>> _______________________________________________
>>>>>>> nupic mailing list
>>>>>>> [email protected]
>>>>>>> http://lists.numenta.org/mailman/listinfo/nupic_lists.numenta.org***
>>>>>>> *
>>>>>>>
>>>>>>> ** **
>>>>>>>
>>>>>>> _______________________________________________
>>>>>>> nupic mailing list
>>>>>>> [email protected]
>>>>>>> http://lists.numenta.org/mailman/listinfo/nupic_lists.numenta.org***
>>>>>>> *
>>>>>>>
>>>>>>> ** **
>>>>>>>
>>>>>>> _______________________________________________
>>>>>>> nupic mailing list
>>>>>>> [email protected]
>>>>>>> http://lists.numenta.org/mailman/listinfo/nupic_lists.numenta.org
>>>>>>>
>>>>>>>
>>>>>>
>>>>>>
>>>>>> --
>>>>>> James Tauber
>>>>>> http://jtauber.com/
>>>>>> @jtauber on Twitter
>>>>>>  _______________________________________________
>>>>>> nupic mailing list
>>>>>> [email protected]
>>>>>> http://lists.numenta.org/mailman/listinfo/nupic_lists.numenta.org
>>>>>>
>>>>>>
>>>>>>
>>>>>> _______________________________________________
>>>>>> nupic mailing list
>>>>>> [email protected]
>>>>>> http://lists.numenta.org/mailman/listinfo/nupic_lists.numenta.org
>>>>>>
>>>>>>
>>>>>
>>>>>
>>>>> --
>>>>> James Tauber
>>>>> http://jtauber.com/
>>>>> @jtauber on Twitter
>>>>>  _______________________________________________
>>>>> nupic mailing list
>>>>> [email protected]
>>>>> http://lists.numenta.org/mailman/listinfo/nupic_lists.numenta.org
>>>>>
>>>>>
>>>>>
>>>>> _______________________________________________
>>>>> nupic mailing list
>>>>> [email protected]
>>>>> http://lists.numenta.org/mailman/listinfo/nupic_lists.numenta.org
>>>>>
>>>>>
>>>>
>>>>
>>>> --
>>>> James Tauber
>>>> http://jtauber.com/
>>>> @jtauber on Twitter
>>>>
>>>
>>>
>>>
>>> --
>>> James Tauber
>>> http://jtauber.com/
>>> @jtauber on Twitter
>>>
>>
>>
>>
>> --
>> James Tauber
>> http://jtauber.com/
>> @jtauber on Twitter
>>  _______________________________________________
>> nupic mailing list
>> [email protected]
>> http://lists.numenta.org/mailman/listinfo/nupic_lists.numenta.org
>>
>>
>>
>> _______________________________________________
>> nupic mailing list
>> [email protected]
>> http://lists.numenta.org/mailman/listinfo/nupic_lists.numenta.org
>>
>>
>
> _______________________________________________
> nupic mailing list
> [email protected]
> http://lists.numenta.org/mailman/listinfo/nupic_lists.numenta.org
>
>


-- 
James Tauber
http://jtauber.com/
@jtauber on Twitter

_______________________________________________
nupic mailing list
[email protected]
http://lists.numenta.org/mailman/listinfo/nupic_lists.numenta.org

Re: [nupic-dev] HTM in Natural Language Processing

Reply via email to