Re: [nupic-dev] HTM in Natural Language Processing

Matthew Taylor Wed, 28 Aug 2013 11:18:40 -0700

If anyone starts to work on tasks in Francisco's list of statistical
characteristics, please reply here so we don't have any duplication of work.


---------
Matt Taylor
OS Community Flag-Bearer
Numenta


On Wed, Aug 28, 2013 at 10:14 AM, Francisco Webber <[email protected]> wrote:

> James, Thats great!
>
> A next step would be to calculate some statistical characteristics of the
> collection.
> Typically:
>
> - Size in Bytes of the Collection
> - Size in Bytes of each Document
> - Word count of the Collection (punctuation signs should count a words too)
> - Word count of each Document (idem)
> - Wordlist of the Collection (each occurring word has an entry)
> - Wordlist of each Document (idem)
> - Coverage of vocabulary of each Document in percent of the Collection
> vocabulary (maybe also unique vocabulary for each Document)
>
> The last line will tell us if the coverage is evenly distributed over the
> different documents. We might eliminate some of them from the list if they
> don't match.
>
> In the end we could make a script that gives each of the calculated items
> a speaking name, casts it as a constant and generates an include file. This
> makes it easy to create the evaluation code later.
>
> Francisco
>
>
> On 28.08.2013, at 18:47, James Tauber wrote:
>
> I've actually moved the texts to a full-blown GitHub repo:
>
> https://github.com/jtauber/nupic-texts
>
>  so feel free to log issues against it if other changes are necessary
> and/or fork and do pull requests if you want to change/add anything.
>
> James
>
>
> On Tue, Aug 27, 2013 at 1:54 PM, James Tauber <[email protected]> wrote:
>
>> All done:
>>
>> https://gist.github.com/jtauber/6347309
>>
>>
>>
>>
>> On Tue, Aug 27, 2013 at 12:35 PM, James Tauber <[email protected]>wrote:
>>
>>> yep, I'm working on it :-)
>>>
>>>
>>> On Tue, Aug 27, 2013 at 12:29 PM, Francisco Webber <[email protected]>wrote:
>>>
>>>> yes James that looks perfect.
>>>> great job!
>>>> Now we need the other tales in the same format.
>>>>
>>>> Francisco
>>>>
>>>> On 27.08.2013, at 15:14, James Tauber wrote:
>>>>
>>>> Let me know if this is what you had in mind (just the ugly duckling):
>>>>
>>>> https://gist.github.com/jtauber/6347309#file-the_ugly_duckling-txt
>>>>
>>>> I put each paragraph on its own line and separated the sections (that
>>>> formerly were separated by a row of asterisks) with a blank line.
>>>>
>>>> James
>>>>
>>>>
>>>> On Tue, Aug 27, 2013 at 7:59 AM, Francisco De Sousa Webber <
>>>> [email protected]> wrote:
>>>>
>>>>> James,
>>>>> thats great!
>>>>> I think that there are some more preparations necessary:
>>>>> - All CRLF should be removed. Keeping one blank after each full stop.
>>>>> (This makes it easier for most parsers)
>>>>> - The line of asterisks should be replaced by a CRLF to mark the
>>>>> paragraphs. (We never know but we could need paragraph info at some time)
>>>>> - The file as such should be split into single tales. (Whatever
>>>>> experiments we run, if we rerun them with different tales, results become
>>>>> more comparable)
>>>>> - The title should not be written in caps. (Capital letter+Full Stop
>>>>> is interpreted as acronym or middle name instead of a sentence delimiter)
>>>>>
>>>>> Francisco
>>>>>
>>>>>
>>>>> Am 27.08.2013 um 00:22 schrieb James Tauber <[email protected]>:
>>>>>
>>>>> I've removed the metadata, the vocab lists and the illustrations:
>>>>>
>>>>> https://gist.github.com/jtauber/6347309
>>>>>
>>>>> James
>>>>>
>>>>>
>>>>> On Mon, Aug 26, 2013 at 2:10 PM, Jeff Hawkins <[email protected]>wrote:
>>>>>
>>>>>> I am sold on the kid’s story idea.  I looked at the link below and
>>>>>> there is a lot of meta data in this file.  It would have to be removed
>>>>>> before feeding to the CLA.****
>>>>>>
>>>>>> ** **
>>>>>>
>>>>>> My assumption is that we would need a CLA with more columns than the
>>>>>> standard 2048.  How many bits are in your word fingerprints?  Could we 
>>>>>> make
>>>>>> each bit a column and skip the SP?****
>>>>>>
>>>>>> Jeff****
>>>>>>
>>>>>> ** **
>>>>>>
>>>>>> *From:* nupic [mailto:[email protected]] *On Behalf Of
>>>>>> *Francisco Webber
>>>>>> *Sent:* Monday, August 26, 2013 3:50 AM
>>>>>>
>>>>>> *To:* NuPIC general mailing list.
>>>>>> *Subject:* Re: [nupic-dev] HTM in Natural Language Processing****
>>>>>>
>>>>>> ** **
>>>>>>
>>>>>> Ian,****
>>>>>>
>>>>>> I also thought about something from the Gutenberg repository.****
>>>>>>
>>>>>> But I think we should start with something from the Kids Shelf.****
>>>>>>
>>>>>> ** **
>>>>>>
>>>>>> There are several reasons in my opinion:****
>>>>>>
>>>>>> ** **
>>>>>>
>>>>>> - We start experimentation with a full bag of unknown parameters, so
>>>>>> keeping the test material simple would allow us to detect the important
>>>>>> ones sooner. And it is quite some work to create a reliable evaluation
>>>>>> framework, so the size of the data set makes a difference.****
>>>>>>
>>>>>> - Keeping the text simple and short reduces substantially the overall
>>>>>> vocabulary. If we want people to also evaluate offline, matching
>>>>>> fingerprints can become a lengthy process without an efficient similarity
>>>>>> engine.****
>>>>>>
>>>>>> - Another reason is the fact that we don't know how much a given set
>>>>>> of columns (like the 2048 typically used) can absorb information. In 
>>>>>> other
>>>>>> words: what is the optimal ratio between a first layer of a text-HTM and
>>>>>> the amount of text.****
>>>>>>
>>>>>> - Lastly I believe that the sequence in which text is presented to
>>>>>> the CLA is of importance. After all when humans learn information by
>>>>>> reading, they also start from simple to complex language. The amount of 
>>>>>> new
>>>>>> vocabulary during training, should be relatively stable (the actual 
>>>>>> amount
>>>>>> would probably be linked to the ratio of my previous argument) ****
>>>>>>
>>>>>> ** **
>>>>>>
>>>>>> So we should build continuously more complex training data sets,
>>>>>> finally ending up with "true"  books like the ones you listed.****
>>>>>>
>>>>>> ** **
>>>>>>
>>>>>> To start I would suggest something like:****
>>>>>>
>>>>>> ** **
>>>>>>
>>>>>> A Primary Reader: Old-time Stories, Fairy Tales and Myths Retold by
>>>>>> Children****
>>>>>>
>>>>>> http://www.gutenberg.org/ebooks/7841****
>>>>>>
>>>>>> ** **
>>>>>>
>>>>>> But there might still be better ones…****
>>>>>>
>>>>>> ** **
>>>>>>
>>>>>> Francisco****
>>>>>>
>>>>>> ** **
>>>>>>
>>>>>>  ****
>>>>>>
>>>>>> ** **
>>>>>>
>>>>>> On 25.08.2013, at 23:05, Ian Danforth wrote:****
>>>>>>
>>>>>>
>>>>>>
>>>>>> ****
>>>>>>
>>>>>> I will make 3 suggestions. All are out of copyright, well known,
>>>>>> uncontroversial, and still taught in schools (At least in the US)****
>>>>>>
>>>>>> ** **
>>>>>>
>>>>>> 1. Robinson Crusoe - Daniel Defoe****
>>>>>>
>>>>>> ** **
>>>>>>
>>>>>> http://www.gutenberg.org/ebooks/521****
>>>>>>
>>>>>> ** **
>>>>>>
>>>>>> 2. Great Expectations - Charles Dickens****
>>>>>>
>>>>>> ** **
>>>>>>
>>>>>> http://www.gutenberg.org/ebooks/1400****
>>>>>>
>>>>>> ** **
>>>>>>
>>>>>> 3. The Time Machine - H.G. Wells****
>>>>>>
>>>>>> ** **
>>>>>>
>>>>>> http://www.gutenberg.org/ebooks/35****
>>>>>>
>>>>>> ** **
>>>>>>
>>>>>> Ian****
>>>>>>
>>>>>> ** **
>>>>>>
>>>>>> On Sat, Aug 24, 2013 at 10:24 AM, Francisco Webber <[email protected]>
>>>>>> wrote:****
>>>>>>
>>>>>> For those who don't want to use the API and for evaluation purposes,
>>>>>> I would propose that we choose some reference text and I convert it into 
>>>>>> a
>>>>>> sequence of SDRs. This file could be used for training.****
>>>>>>
>>>>>> I would also generate a list of all words contained in the text,
>>>>>> together with their SDRs to be used as conversion table.****
>>>>>>
>>>>>> As a simple test measure we could feed a sequence of SDRs into a
>>>>>> trained network and see if the HTM makes the right prediction about the
>>>>>> following word(s). ****
>>>>>>
>>>>>> The last file to produce for a complete framework would be a list of
>>>>>> lets say 100 word sequences with their correct continuation.****
>>>>>>
>>>>>> The word sequences could be for example the beginnings of phrases
>>>>>> with more than n words (n being the number of steps ahead that the CLA 
>>>>>> can
>>>>>> predict ahead)****
>>>>>>
>>>>>> This could be the beginning of a measuring set-up that allows to
>>>>>> compare different CLA-implementation flavors.****
>>>>>>
>>>>>> ** **
>>>>>>
>>>>>> Any suggestions for a text to choose?****
>>>>>>
>>>>>> ** **
>>>>>>
>>>>>> Francisco****
>>>>>>
>>>>>> ** **
>>>>>>
>>>>>> On 24.08.2013, at 17:12, Matthew Taylor wrote:****
>>>>>>
>>>>>> ** **
>>>>>>
>>>>>> Very cool, Francisco. Here is where you can get cept API credentials:
>>>>>> https://cept.3scale.net/signup****
>>>>>>
>>>>>>
>>>>>> ****
>>>>>>
>>>>>> ---------****
>>>>>>
>>>>>> Matt Taylor****
>>>>>>
>>>>>> OS Community Flag-Bearer****
>>>>>>
>>>>>> Numenta****
>>>>>>
>>>>>> ** **
>>>>>>
>>>>>> On Fri, Aug 23, 2013 at 5:07 PM, Francisco Webber <[email protected]>
>>>>>> wrote:****
>>>>>>
>>>>>> Just a short post scriptum:
>>>>>>
>>>>>> The public version of our API doesn't actually contain the generic
>>>>>> conversion function. But if people from the HTM community want to
>>>>>> experiment just click the "Request for Beta-Program" button and I will
>>>>>> upgrade your accounts manually.
>>>>>>
>>>>>> Francisco****
>>>>>>
>>>>>>
>>>>>> On 24.08.2013, at 01:59, Francisco Webber wrote:
>>>>>>
>>>>>> > Jeff,
>>>>>> > I thought about this already.
>>>>>> > We have a REST API where you can send a word in and get the SDR
>>>>>> back, and vice versa.
>>>>>> > I invite all who want to experiment to try it out.
>>>>>> > You just need to get credentials at our website: www.cept.at.
>>>>>> >
>>>>>> > In mid-term it would be cool to create some sort of evaluation set,
>>>>>> that could be used to measure progress while improving the CLA.
>>>>>> >
>>>>>> > We are continuously improving our Retina but the version that is
>>>>>> currently online works pretty well already.
>>>>>> >
>>>>>> > I hope that will help
>>>>>> >
>>>>>> > Francisco
>>>>>> >
>>>>>> > On 24.08.2013, at 01:46, Jeff Hawkins wrote:
>>>>>> >
>>>>>> >> Francisco,
>>>>>> >> Your work is very cool.  Do you think it would be possible to make
>>>>>> available
>>>>>> >> your word SDRs (or a sufficient subset of them) for
>>>>>> experimentation?  I
>>>>>> >> imagine there would be interested in the NuPIC community in
>>>>>> training a CLA
>>>>>> >> on text using your word SDRs.  You might get some useful results
>>>>>> more
>>>>>> >> quickly.  You could do this under a research only license or
>>>>>> something like
>>>>>> >> that.
>>>>>> >> Jeff
>>>>>> >>
>>>>>> >> -----Original Message-----
>>>>>> >> From: nupic [mailto:[email protected]] On Behalf Of
>>>>>> Francisco
>>>>>> >> Webber
>>>>>> >> Sent: Wednesday, August 21, 2013 1:01 PM
>>>>>> >> To: NuPIC general mailing list.
>>>>>> >> Subject: Re: [nupic-dev] HTM in Natural Language Processing
>>>>>> >>
>>>>>> >> Hello,
>>>>>> >> I am one of the founders of CEPT Systems and lead researcher of
>>>>>> our retina
>>>>>> >> algorithm.
>>>>>> >>
>>>>>> >> We have developed a method to represent words by a bitmap pattern
>>>>>> capturing
>>>>>> >> most of its "lexical semantics". (A text sensor) Our word-SDRs
>>>>>> fulfill all
>>>>>> >> the requirements for "good" HTM input data.
>>>>>> >>
>>>>>> >> - Words with similar meaning "look" similar
>>>>>> >> - If you drop random bits in the representation the semantics
>>>>>> remain intact
>>>>>> >> - Only a small number (up to 5%) of bits are set in a word-SDR
>>>>>> >> - Every bit in the representation corresponds to a specific
>>>>>> semantic feature
>>>>>> >> of the language used
>>>>>> >> - The Retina (sensory organ for a HTM) can be trained on any
>>>>>> language
>>>>>> >> - The retina training process is fully unsupervised.
>>>>>> >>
>>>>>> >> We have found out that the word-SDR by itself (without using any
>>>>>> HTM yet)
>>>>>> >> can improve many NLP problems that are only poorly solved using the
>>>>>> >> traditional statistic approaches.
>>>>>> >> We use the SDRs to:
>>>>>> >> - Create fingerprints of text documents which allows us to compare
>>>>>> them for
>>>>>> >> semantic similarity using simple (euclidian) similarity measures
>>>>>> >> - We can automatically detect polysemy and disambiguate multiple
>>>>>> meanings.
>>>>>> >> - We can characterize any text with context terms for automatic
>>>>>> >> search-engine query-expansion .
>>>>>> >>
>>>>>> >> We hope to successfully link-up our Retina to an HTM network to go
>>>>>> beyond
>>>>>> >> lexical semantics into the field of "grammatical semantics".
>>>>>> >> This would hopefully lead to improved abstracting-, conversation-,
>>>>>> question
>>>>>> >> answering- and translation- systems..
>>>>>> >>
>>>>>> >> Our correct web address is www.cept.at (no kangaroos in Vienna ;-)
>>>>>> >>
>>>>>> >> I am interested in any form of cooperation to apply HTM technology
>>>>>> to text.
>>>>>> >>
>>>>>> >> Francisco
>>>>>> >>
>>>>>> >> On 21.08.2013, at 20:16, Christian Cleber Masdeval Braz wrote:
>>>>>> >>
>>>>>> >>>
>>>>>> >>> Hello.
>>>>>> >>>
>>>>>> >>> As many of you here i am prety new in HTM technology.
>>>>>> >>>
>>>>>> >>> I am a researcher in Brazil and I am going to start my Phd
>>>>>> program soon.
>>>>>> >> My field of interest is NLP and the extraction of knowledge from
>>>>>> text. I am
>>>>>> >> thinking to use the ideas behind the Memory Prediction Framework to
>>>>>> >> investigate semantic information retrieval from the Web, and answer
>>>>>> >> questions in natural language. I intend to use the HTM
>>>>>> implementation as
>>>>>> >> base to do this.
>>>>>> >>>
>>>>>> >>> I apreciate a lot if someone could answer some questions:
>>>>>> >>>
>>>>>> >>> - Are there some researches related to HTM and NLP? Could
>>>>>> indicate them?
>>>>>> >>>
>>>>>> >>> - Is HTM proper to address this problem? Could it learn, without
>>>>>> >> supervision, the grammar of a language or just help in some
>>>>>> aspects as Named
>>>>>> >> Entity Recognition?
>>>>>> >>>
>>>>>> >>>
>>>>>> >>>
>>>>>> >>> Regards,
>>>>>> >>>
>>>>>> >>> Christian
>>>>>> >>>
>>>>>> >>>
>>>>>> >>> _______________________________________________
>>>>>> >>> nupic mailing list
>>>>>> >>> [email protected]
>>>>>> >>> http://lists.numenta.org/mailman/listinfo/nupic_lists.numenta.org
>>>>>> >>
>>>>>> >>
>>>>>> >> _______________________________________________
>>>>>> >> nupic mailing list
>>>>>> >> [email protected]
>>>>>> >> http://lists.numenta.org/mailman/listinfo/nupic_lists.numenta.org
>>>>>> >>
>>>>>> >>
>>>>>> >> _______________________________________________
>>>>>> >> nupic mailing list
>>>>>> >> [email protected]
>>>>>> >> http://lists.numenta.org/mailman/listinfo/nupic_lists.numenta.org
>>>>>> >
>>>>>> >
>>>>>> > _______________________________________________
>>>>>> > nupic mailing list
>>>>>> > [email protected]
>>>>>> > http://lists.numenta.org/mailman/listinfo/nupic_lists.numenta.org
>>>>>>
>>>>>>
>>>>>> _______________________________________________
>>>>>> nupic mailing list
>>>>>> [email protected]
>>>>>> http://lists.numenta.org/mailman/listinfo/nupic_lists.numenta.org****
>>>>>>
>>>>>> ** **
>>>>>>
>>>>>> _______________________________________________
>>>>>> nupic mailing list
>>>>>> [email protected]
>>>>>> http://lists.numenta.org/mailman/listinfo/nupic_lists.numenta.org****
>>>>>>
>>>>>> ** **
>>>>>>
>>>>>>
>>>>>> _______________________________________________
>>>>>> nupic mailing list
>>>>>> [email protected]
>>>>>> http://lists.numenta.org/mailman/listinfo/nupic_lists.numenta.org****
>>>>>>
>>>>>> ** **
>>>>>>
>>>>>> _______________________________________________
>>>>>> nupic mailing list
>>>>>> [email protected]
>>>>>> http://lists.numenta.org/mailman/listinfo/nupic_lists.numenta.org****
>>>>>>
>>>>>> ** **
>>>>>>
>>>>>> _______________________________________________
>>>>>> nupic mailing list
>>>>>> [email protected]
>>>>>> http://lists.numenta.org/mailman/listinfo/nupic_lists.numenta.org
>>>>>>
>>>>>>
>>>>>
>>>>>
>>>>> --
>>>>> James Tauber
>>>>> http://jtauber.com/
>>>>> @jtauber on Twitter
>>>>>  _______________________________________________
>>>>> nupic mailing list
>>>>> [email protected]
>>>>> http://lists.numenta.org/mailman/listinfo/nupic_lists.numenta.org
>>>>>
>>>>>
>>>>>
>>>>> _______________________________________________
>>>>> nupic mailing list
>>>>> [email protected]
>>>>> http://lists.numenta.org/mailman/listinfo/nupic_lists.numenta.org
>>>>>
>>>>>
>>>>
>>>>
>>>> --
>>>> James Tauber
>>>> http://jtauber.com/
>>>> @jtauber on Twitter
>>>>  _______________________________________________
>>>> nupic mailing list
>>>> [email protected]
>>>> http://lists.numenta.org/mailman/listinfo/nupic_lists.numenta.org
>>>>
>>>>
>>>>
>>>> _______________________________________________
>>>> nupic mailing list
>>>> [email protected]
>>>> http://lists.numenta.org/mailman/listinfo/nupic_lists.numenta.org
>>>>
>>>>
>>>
>>>
>>> --
>>> James Tauber
>>> http://jtauber.com/
>>> @jtauber on Twitter
>>>
>>
>>
>>
>> --
>> James Tauber
>> http://jtauber.com/
>> @jtauber on Twitter
>>
>
>
>
> --
> James Tauber
> http://jtauber.com/
> @jtauber on Twitter
>  _______________________________________________
> nupic mailing list
> [email protected]
> http://lists.numenta.org/mailman/listinfo/nupic_lists.numenta.org
>
>
>
> _______________________________________________
> nupic mailing list
> [email protected]
> http://lists.numenta.org/mailman/listinfo/nupic_lists.numenta.org
>
>

_______________________________________________
nupic mailing list
[email protected]
http://lists.numenta.org/mailman/listinfo/nupic_lists.numenta.org

Re: [nupic-dev] HTM in Natural Language Processing

Reply via email to