Re: [nupic-dev] HTM in Natural Language Processing

Francisco Webber Wed, 28 Aug 2013 12:53:04 -0700

Ok,
I will work on the specs further.

Francisco


On 28.08.2013, at 20:21, James Tauber wrote:

> I plan to work on it tonight and will commit the python scripts I write for 
> them to my repo.
> 
> James
> 
> 
> On Wed, Aug 28, 2013 at 2:16 PM, Matthew Taylor <[email protected]> wrote:
> If anyone starts to work on tasks in Francisco's list of statistical 
> characteristics, please reply here so we don't have any duplication of work.
> 
> ---------
> Matt Taylor
> OS Community Flag-Bearer
> Numenta
> 
> 
> On Wed, Aug 28, 2013 at 10:14 AM, Francisco Webber <[email protected]> wrote:
> James, Thats great!
> 
> A next step would be to calculate some statistical characteristics of the 
> collection. 
> Typically:
> 
> - Size in Bytes of the Collection
> - Size in Bytes of each Document
> - Word count of the Collection (punctuation signs should count a words too)
> - Word count of each Document (idem)
> - Wordlist of the Collection (each occurring word has an entry)
> - Wordlist of each Document (idem)
> - Coverage of vocabulary of each Document in percent of the Collection 
> vocabulary (maybe also unique vocabulary for each Document)
> 
> The last line will tell us if the coverage is evenly distributed over the 
> different documents. We might eliminate some of them from the list if they 
> don't match.
> 
> In the end we could make a script that gives each of the calculated items a 
> speaking name, casts it as a constant and generates an include file. This 
> makes it easy to create the evaluation code later.
> 
> Francisco
> 
> 
> On 28.08.2013, at 18:47, James Tauber wrote:
> 
>> I've actually moved the texts to a full-blown GitHub repo:
>> 
>> https://github.com/jtauber/nupic-texts
>> 
>> so feel free to log issues against it if other changes are necessary and/or 
>> fork and do pull requests if you want to change/add anything.
>> 
>> James
>> 
>> 
>> On Tue, Aug 27, 2013 at 1:54 PM, James Tauber <[email protected]> wrote:
>> All done:
>> 
>> https://gist.github.com/jtauber/6347309
>> 
>> 
>> 
>> 
>> On Tue, Aug 27, 2013 at 12:35 PM, James Tauber <[email protected]> wrote:
>> yep, I'm working on it :-)
>> 
>> 
>> On Tue, Aug 27, 2013 at 12:29 PM, Francisco Webber <[email protected]> wrote:
>> yes James that looks perfect.
>> great job!
>> Now we need the other tales in the same format.
>> 
>> Francisco
>> 
>> On 27.08.2013, at 15:14, James Tauber wrote:
>> 
>>> Let me know if this is what you had in mind (just the ugly duckling):
>>> 
>>> https://gist.github.com/jtauber/6347309#file-the_ugly_duckling-txt
>>> 
>>> I put each paragraph on its own line and separated the sections (that 
>>> formerly were separated by a row of asterisks) with a blank line. 
>>> 
>>> James
>>> 
>>> 
>>> On Tue, Aug 27, 2013 at 7:59 AM, Francisco De Sousa Webber 
>>> <[email protected]> wrote:
>>> James,
>>> thats great!
>>> I think that there are some more preparations necessary:
>>> - All CRLF should be removed. Keeping one blank after each full stop. (This 
>>> makes it easier for most parsers)
>>> - The line of asterisks should be replaced by a CRLF to mark the 
>>> paragraphs. (We never know but we could need paragraph info at some time)
>>> - The file as such should be split into single tales. (Whatever experiments 
>>> we run, if we rerun them with different tales, results become more 
>>> comparable)
>>> - The title should not be written in caps. (Capital letter+Full Stop is 
>>> interpreted as acronym or middle name instead of a sentence delimiter)
>>> 
>>> Francisco
>>> 
>>> 
>>> Am 27.08.2013 um 00:22 schrieb James Tauber <[email protected]>:
>>> 
>>>> I've removed the metadata, the vocab lists and the illustrations:
>>>> 
>>>> https://gist.github.com/jtauber/6347309
>>>> 
>>>> James
>>>> 
>>>> 
>>>> On Mon, Aug 26, 2013 at 2:10 PM, Jeff Hawkins <[email protected]> wrote:
>>>> I am sold on the kid’s story idea.  I looked at the link below and there 
>>>> is a lot of meta data in this file.  It would have to be removed before 
>>>> feeding to the CLA.
>>>> 
>>>>  
>>>> 
>>>> My assumption is that we would need a CLA with more columns than the 
>>>> standard 2048.  How many bits are in your word fingerprints?  Could we 
>>>> make each bit a column and skip the SP?
>>>> 
>>>> Jeff
>>>> 
>>>>  
>>>> 
>>>> From: nupic [mailto:[email protected]] On Behalf Of 
>>>> Francisco Webber
>>>> Sent: Monday, August 26, 2013 3:50 AM
>>>> 
>>>> 
>>>> To: NuPIC general mailing list.
>>>> Subject: Re: [nupic-dev] HTM in Natural Language Processing
>>>> 
>>>>  
>>>> 
>>>> Ian,
>>>> 
>>>> I also thought about something from the Gutenberg repository.
>>>> 
>>>> But I think we should start with something from the Kids Shelf.
>>>> 
>>>>  
>>>> 
>>>> There are several reasons in my opinion:
>>>> 
>>>>  
>>>> 
>>>> - We start experimentation with a full bag of unknown parameters, so 
>>>> keeping the test material simple would allow us to detect the important 
>>>> ones sooner. And it is quite some work to create a reliable evaluation 
>>>> framework, so the size of the data set makes a difference.
>>>> 
>>>> - Keeping the text simple and short reduces substantially the overall 
>>>> vocabulary. If we want people to also evaluate offline, matching 
>>>> fingerprints can become a lengthy process without an efficient similarity 
>>>> engine.
>>>> 
>>>> - Another reason is the fact that we don't know how much a given set of 
>>>> columns (like the 2048 typically used) can absorb information. In other 
>>>> words: what is the optimal ratio between a first layer of a text-HTM and 
>>>> the amount of text.
>>>> 
>>>> - Lastly I believe that the sequence in which text is presented to the CLA 
>>>> is of importance. After all when humans learn information by reading, they 
>>>> also start from simple to complex language. The amount of new vocabulary 
>>>> during training, should be relatively stable (the actual amount would 
>>>> probably be linked to the ratio of my previous argument) 
>>>> 
>>>>  
>>>> 
>>>> So we should build continuously more complex training data sets, finally 
>>>> ending up with "true"  books like the ones you listed.
>>>> 
>>>>  
>>>> 
>>>> To start I would suggest something like:
>>>> 
>>>>  
>>>> 
>>>> A Primary Reader: Old-time Stories, Fairy Tales and Myths Retold by 
>>>> Children
>>>> 
>>>> http://www.gutenberg.org/ebooks/7841
>>>> 
>>>>  
>>>> 
>>>> But there might still be better ones…
>>>> 
>>>>  
>>>> 
>>>> Francisco
>>>> 
>>>>  
>>>> 
>>>>  
>>>> 
>>>>  
>>>> 
>>>> On 25.08.2013, at 23:05, Ian Danforth wrote:
>>>> 
>>>> 
>>>> 
>>>> 
>>>> I will make 3 suggestions. All are out of copyright, well known, 
>>>> uncontroversial, and still taught in schools (At least in the US)
>>>> 
>>>>  
>>>> 
>>>> 1. Robinson Crusoe - Daniel Defoe
>>>> 
>>>>  
>>>> 
>>>> http://www.gutenberg.org/ebooks/521
>>>> 
>>>>  
>>>> 
>>>> 2. Great Expectations - Charles Dickens
>>>> 
>>>>  
>>>> 
>>>> http://www.gutenberg.org/ebooks/1400
>>>> 
>>>>  
>>>> 
>>>> 3. The Time Machine - H.G. Wells
>>>> 
>>>>  
>>>> 
>>>> http://www.gutenberg.org/ebooks/35
>>>> 
>>>>  
>>>> 
>>>> Ian
>>>> 
>>>>  
>>>> 
>>>> On Sat, Aug 24, 2013 at 10:24 AM, Francisco Webber <[email protected]> 
>>>> wrote:
>>>> 
>>>> For those who don't want to use the API and for evaluation purposes, I 
>>>> would propose that we choose some reference text and I convert it into a 
>>>> sequence of SDRs. This file could be used for training.
>>>> 
>>>> I would also generate a list of all words contained in the text, together 
>>>> with their SDRs to be used as conversion table.
>>>> 
>>>> As a simple test measure we could feed a sequence of SDRs into a trained 
>>>> network and see if the HTM makes the right prediction about the following 
>>>> word(s). 
>>>> 
>>>> The last file to produce for a complete framework would be a list of lets 
>>>> say 100 word sequences with their correct continuation.
>>>> 
>>>> The word sequences could be for example the beginnings of phrases with 
>>>> more than n words (n being the number of steps ahead that the CLA can 
>>>> predict ahead)
>>>> 
>>>> This could be the beginning of a measuring set-up that allows to compare 
>>>> different CLA-implementation flavors.
>>>> 
>>>>  
>>>> 
>>>> Any suggestions for a text to choose?
>>>> 
>>>>  
>>>> 
>>>> Francisco
>>>> 
>>>>  
>>>> 
>>>> On 24.08.2013, at 17:12, Matthew Taylor wrote:
>>>> 
>>>>  
>>>> 
>>>> Very cool, Francisco. Here is where you can get cept API credentials: 
>>>> https://cept.3scale.net/signup
>>>> 
>>>> 
>>>> 
>>>> ---------
>>>> 
>>>> Matt Taylor
>>>> 
>>>> OS Community Flag-Bearer
>>>> 
>>>> Numenta
>>>> 
>>>>  
>>>> 
>>>> On Fri, Aug 23, 2013 at 5:07 PM, Francisco Webber <[email protected]> wrote:
>>>> 
>>>> Just a short post scriptum:
>>>> 
>>>> The public version of our API doesn't actually contain the generic 
>>>> conversion function. But if people from the HTM community want to 
>>>> experiment just click the "Request for Beta-Program" button and I will 
>>>> upgrade your accounts manually.
>>>> 
>>>> Francisco
>>>> 
>>>> 
>>>> On 24.08.2013, at 01:59, Francisco Webber wrote:
>>>> 
>>>> > Jeff,
>>>> > I thought about this already.
>>>> > We have a REST API where you can send a word in and get the SDR back, 
>>>> > and vice versa.
>>>> > I invite all who want to experiment to try it out.
>>>> > You just need to get credentials at our website: www.cept.at.
>>>> >
>>>> > In mid-term it would be cool to create some sort of evaluation set, that 
>>>> > could be used to measure progress while improving the CLA.
>>>> >
>>>> > We are continuously improving our Retina but the version that is 
>>>> > currently online works pretty well already.
>>>> >
>>>> > I hope that will help
>>>> >
>>>> > Francisco
>>>> >
>>>> > On 24.08.2013, at 01:46, Jeff Hawkins wrote:
>>>> >
>>>> >> Francisco,
>>>> >> Your work is very cool.  Do you think it would be possible to make 
>>>> >> available
>>>> >> your word SDRs (or a sufficient subset of them) for experimentation?  I
>>>> >> imagine there would be interested in the NuPIC community in training a 
>>>> >> CLA
>>>> >> on text using your word SDRs.  You might get some useful results more
>>>> >> quickly.  You could do this under a research only license or something 
>>>> >> like
>>>> >> that.
>>>> >> Jeff
>>>> >>
>>>> >> -----Original Message-----
>>>> >> From: nupic [mailto:[email protected]] On Behalf Of 
>>>> >> Francisco
>>>> >> Webber
>>>> >> Sent: Wednesday, August 21, 2013 1:01 PM
>>>> >> To: NuPIC general mailing list.
>>>> >> Subject: Re: [nupic-dev] HTM in Natural Language Processing
>>>> >>
>>>> >> Hello,
>>>> >> I am one of the founders of CEPT Systems and lead researcher of our 
>>>> >> retina
>>>> >> algorithm.
>>>> >>
>>>> >> We have developed a method to represent words by a bitmap pattern 
>>>> >> capturing
>>>> >> most of its "lexical semantics". (A text sensor) Our word-SDRs fulfill 
>>>> >> all
>>>> >> the requirements for "good" HTM input data.
>>>> >>
>>>> >> - Words with similar meaning "look" similar
>>>> >> - If you drop random bits in the representation the semantics remain 
>>>> >> intact
>>>> >> - Only a small number (up to 5%) of bits are set in a word-SDR
>>>> >> - Every bit in the representation corresponds to a specific semantic 
>>>> >> feature
>>>> >> of the language used
>>>> >> - The Retina (sensory organ for a HTM) can be trained on any language
>>>> >> - The retina training process is fully unsupervised.
>>>> >>
>>>> >> We have found out that the word-SDR by itself (without using any HTM 
>>>> >> yet)
>>>> >> can improve many NLP problems that are only poorly solved using the
>>>> >> traditional statistic approaches.
>>>> >> We use the SDRs to:
>>>> >> - Create fingerprints of text documents which allows us to compare them 
>>>> >> for
>>>> >> semantic similarity using simple (euclidian) similarity measures
>>>> >> - We can automatically detect polysemy and disambiguate multiple 
>>>> >> meanings.
>>>> >> - We can characterize any text with context terms for automatic
>>>> >> search-engine query-expansion .
>>>> >>
>>>> >> We hope to successfully link-up our Retina to an HTM network to go 
>>>> >> beyond
>>>> >> lexical semantics into the field of "grammatical semantics".
>>>> >> This would hopefully lead to improved abstracting-, conversation-, 
>>>> >> question
>>>> >> answering- and translation- systems..
>>>> >>
>>>> >> Our correct web address is www.cept.at (no kangaroos in Vienna ;-)
>>>> >>
>>>> >> I am interested in any form of cooperation to apply HTM technology to 
>>>> >> text.
>>>> >>
>>>> >> Francisco
>>>> >>
>>>> >> On 21.08.2013, at 20:16, Christian Cleber Masdeval Braz wrote:
>>>> >>
>>>> >>>
>>>> >>> Hello.
>>>> >>>
>>>> >>> As many of you here i am prety new in HTM technology.
>>>> >>>
>>>> >>> I am a researcher in Brazil and I am going to start my Phd program 
>>>> >>> soon.
>>>> >> My field of interest is NLP and the extraction of knowledge from text. 
>>>> >> I am
>>>> >> thinking to use the ideas behind the Memory Prediction Framework to
>>>> >> investigate semantic information retrieval from the Web, and answer
>>>> >> questions in natural language. I intend to use the HTM implementation as
>>>> >> base to do this.
>>>> >>>
>>>> >>> I apreciate a lot if someone could answer some questions:
>>>> >>>
>>>> >>> - Are there some researches related to HTM and NLP? Could indicate 
>>>> >>> them?
>>>> >>>
>>>> >>> - Is HTM proper to address this problem? Could it learn, without
>>>> >> supervision, the grammar of a language or just help in some aspects as 
>>>> >> Named
>>>> >> Entity Recognition?
>>>> >>>
>>>> >>>
>>>> >>>
>>>> >>> Regards,
>>>> >>>
>>>> >>> Christian
>>>> >>>
>>>> >>>
>>>> >>> _______________________________________________
>>>> >>> nupic mailing list
>>>> >>> [email protected]
>>>> >>> http://lists.numenta.org/mailman/listinfo/nupic_lists.numenta.org
>>>> >>
>>>> >>
>>>> >> _______________________________________________
>>>> >> nupic mailing list
>>>> >> [email protected]
>>>> >> http://lists.numenta.org/mailman/listinfo/nupic_lists.numenta.org
>>>> >>
>>>> >>
>>>> >> _______________________________________________
>>>> >> nupic mailing list
>>>> >> [email protected]
>>>> >> http://lists.numenta.org/mailman/listinfo/nupic_lists.numenta.org
>>>> >
>>>> >
>>>> > _______________________________________________
>>>> > nupic mailing list
>>>> > [email protected]
>>>> > http://lists.numenta.org/mailman/listinfo/nupic_lists.numenta.org
>>>> 
>>>> 
>>>> _______________________________________________
>>>> nupic mailing list
>>>> [email protected]
>>>> http://lists.numenta.org/mailman/listinfo/nupic_lists.numenta.org
>>>> 
>>>>  
>>>> 
>>>> _______________________________________________
>>>> nupic mailing list
>>>> [email protected]
>>>> http://lists.numenta.org/mailman/listinfo/nupic_lists.numenta.org
>>>> 
>>>>  
>>>> 
>>>> 
>>>> _______________________________________________
>>>> nupic mailing list
>>>> [email protected]
>>>> http://lists.numenta.org/mailman/listinfo/nupic_lists.numenta.org
>>>> 
>>>>  
>>>> 
>>>> _______________________________________________
>>>> nupic mailing list
>>>> [email protected]
>>>> http://lists.numenta.org/mailman/listinfo/nupic_lists.numenta.org
>>>> 
>>>>  
>>>> 
>>>> 
>>>> _______________________________________________
>>>> nupic mailing list
>>>> [email protected]
>>>> http://lists.numenta.org/mailman/listinfo/nupic_lists.numenta.org
>>>> 
>>>> 
>>>> 
>>>> 
>>>> -- 
>>>> James Tauber
>>>> http://jtauber.com/
>>>> @jtauber on Twitter
>>>> _______________________________________________
>>>> nupic mailing list
>>>> [email protected]
>>>> http://lists.numenta.org/mailman/listinfo/nupic_lists.numenta.org
>>> 
>>> 
>>> _______________________________________________
>>> nupic mailing list
>>> [email protected]
>>> http://lists.numenta.org/mailman/listinfo/nupic_lists.numenta.org
>>> 
>>> 
>>> 
>>> 
>>> -- 
>>> James Tauber
>>> http://jtauber.com/
>>> @jtauber on Twitter
>>> _______________________________________________
>>> nupic mailing list
>>> [email protected]
>>> http://lists.numenta.org/mailman/listinfo/nupic_lists.numenta.org
>> 
>> 
>> _______________________________________________
>> nupic mailing list
>> [email protected]
>> http://lists.numenta.org/mailman/listinfo/nupic_lists.numenta.org
>> 
>> 
>> 
>> 
>> -- 
>> James Tauber
>> http://jtauber.com/
>> @jtauber on Twitter
>> 
>> 
>> 
>> -- 
>> James Tauber
>> http://jtauber.com/
>> @jtauber on Twitter
>> 
>> 
>> 
>> -- 
>> James Tauber
>> http://jtauber.com/
>> @jtauber on Twitter
>> _______________________________________________
>> nupic mailing list
>> [email protected]
>> http://lists.numenta.org/mailman/listinfo/nupic_lists.numenta.org
> 
> 
> _______________________________________________
> nupic mailing list
> [email protected]
> http://lists.numenta.org/mailman/listinfo/nupic_lists.numenta.org
> 
> 
> 
> _______________________________________________
> nupic mailing list
> [email protected]
> http://lists.numenta.org/mailman/listinfo/nupic_lists.numenta.org
> 
> 
> 
> 
> -- 
> James Tauber
> http://jtauber.com/
> @jtauber on Twitter
> _______________________________________________
> nupic mailing list
> [email protected]
> http://lists.numenta.org/mailman/listinfo/nupic_lists.numenta.org

_______________________________________________
nupic mailing list
[email protected]
http://lists.numenta.org/mailman/listinfo/nupic_lists.numenta.org

Re: [nupic-dev] HTM in Natural Language Processing

Reply via email to