Re: [nupic-dev] HTM in Natural Language Processing

Francisco Webber Tue, 27 Aug 2013 09:31:06 -0700

yes James that looks perfect.
great job!
Now we need the other tales in the same format.


Francisco

On 27.08.2013, at 15:14, James Tauber wrote:

> Let me know if this is what you had in mind (just the ugly duckling):
> 
> https://gist.github.com/jtauber/6347309#file-the_ugly_duckling-txt
> 
> I put each paragraph on its own line and separated the sections (that 
> formerly were separated by a row of asterisks) with a blank line. 
> 
> James
> 
> 
> On Tue, Aug 27, 2013 at 7:59 AM, Francisco De Sousa Webber <[email protected]> 
> wrote:
> James,
> thats great!
> I think that there are some more preparations necessary:
> - All CRLF should be removed. Keeping one blank after each full stop. (This 
> makes it easier for most parsers)
> - The line of asterisks should be replaced by a CRLF to mark the paragraphs. 
> (We never know but we could need paragraph info at some time)
> - The file as such should be split into single tales. (Whatever experiments 
> we run, if we rerun them with different tales, results become more comparable)
> - The title should not be written in caps. (Capital letter+Full Stop is 
> interpreted as acronym or middle name instead of a sentence delimiter)
> 
> Francisco
> 
> 
> Am 27.08.2013 um 00:22 schrieb James Tauber <[email protected]>:
> 
>> I've removed the metadata, the vocab lists and the illustrations:
>> 
>> https://gist.github.com/jtauber/6347309
>> 
>> James
>> 
>> 
>> On Mon, Aug 26, 2013 at 2:10 PM, Jeff Hawkins <[email protected]> wrote:
>> I am sold on the kid’s story idea.  I looked at the link below and there is 
>> a lot of meta data in this file.  It would have to be removed before feeding 
>> to the CLA.
>> 
>>  
>> 
>> My assumption is that we would need a CLA with more columns than the 
>> standard 2048.  How many bits are in your word fingerprints?  Could we make 
>> each bit a column and skip the SP?
>> 
>> Jeff
>> 
>>  
>> 
>> From: nupic [mailto:[email protected]] On Behalf Of Francisco 
>> Webber
>> Sent: Monday, August 26, 2013 3:50 AM
>> 
>> 
>> To: NuPIC general mailing list.
>> Subject: Re: [nupic-dev] HTM in Natural Language Processing
>> 
>>  
>> 
>> Ian,
>> 
>> I also thought about something from the Gutenberg repository.
>> 
>> But I think we should start with something from the Kids Shelf.
>> 
>>  
>> 
>> There are several reasons in my opinion:
>> 
>>  
>> 
>> - We start experimentation with a full bag of unknown parameters, so keeping 
>> the test material simple would allow us to detect the important ones sooner. 
>> And it is quite some work to create a reliable evaluation framework, so the 
>> size of the data set makes a difference.
>> 
>> - Keeping the text simple and short reduces substantially the overall 
>> vocabulary. If we want people to also evaluate offline, matching 
>> fingerprints can become a lengthy process without an efficient similarity 
>> engine.
>> 
>> - Another reason is the fact that we don't know how much a given set of 
>> columns (like the 2048 typically used) can absorb information. In other 
>> words: what is the optimal ratio between a first layer of a text-HTM and the 
>> amount of text.
>> 
>> - Lastly I believe that the sequence in which text is presented to the CLA 
>> is of importance. After all when humans learn information by reading, they 
>> also start from simple to complex language. The amount of new vocabulary 
>> during training, should be relatively stable (the actual amount would 
>> probably be linked to the ratio of my previous argument) 
>> 
>>  
>> 
>> So we should build continuously more complex training data sets, finally 
>> ending up with "true"  books like the ones you listed.
>> 
>>  
>> 
>> To start I would suggest something like:
>> 
>>  
>> 
>> A Primary Reader: Old-time Stories, Fairy Tales and Myths Retold by Children
>> 
>> http://www.gutenberg.org/ebooks/7841
>> 
>>  
>> 
>> But there might still be better ones…
>> 
>>  
>> 
>> Francisco
>> 
>>  
>> 
>>  
>> 
>>  
>> 
>> On 25.08.2013, at 23:05, Ian Danforth wrote:
>> 
>> 
>> 
>> 
>> I will make 3 suggestions. All are out of copyright, well known, 
>> uncontroversial, and still taught in schools (At least in the US)
>> 
>>  
>> 
>> 1. Robinson Crusoe - Daniel Defoe
>> 
>>  
>> 
>> http://www.gutenberg.org/ebooks/521
>> 
>>  
>> 
>> 2. Great Expectations - Charles Dickens
>> 
>>  
>> 
>> http://www.gutenberg.org/ebooks/1400
>> 
>>  
>> 
>> 3. The Time Machine - H.G. Wells
>> 
>>  
>> 
>> http://www.gutenberg.org/ebooks/35
>> 
>>  
>> 
>> Ian
>> 
>>  
>> 
>> On Sat, Aug 24, 2013 at 10:24 AM, Francisco Webber <[email protected]> wrote:
>> 
>> For those who don't want to use the API and for evaluation purposes, I would 
>> propose that we choose some reference text and I convert it into a sequence 
>> of SDRs. This file could be used for training.
>> 
>> I would also generate a list of all words contained in the text, together 
>> with their SDRs to be used as conversion table.
>> 
>> As a simple test measure we could feed a sequence of SDRs into a trained 
>> network and see if the HTM makes the right prediction about the following 
>> word(s). 
>> 
>> The last file to produce for a complete framework would be a list of lets 
>> say 100 word sequences with their correct continuation.
>> 
>> The word sequences could be for example the beginnings of phrases with more 
>> than n words (n being the number of steps ahead that the CLA can predict 
>> ahead)
>> 
>> This could be the beginning of a measuring set-up that allows to compare 
>> different CLA-implementation flavors.
>> 
>>  
>> 
>> Any suggestions for a text to choose?
>> 
>>  
>> 
>> Francisco
>> 
>>  
>> 
>> On 24.08.2013, at 17:12, Matthew Taylor wrote:
>> 
>>  
>> 
>> Very cool, Francisco. Here is where you can get cept API credentials: 
>> https://cept.3scale.net/signup
>> 
>> 
>> 
>> ---------
>> 
>> Matt Taylor
>> 
>> OS Community Flag-Bearer
>> 
>> Numenta
>> 
>>  
>> 
>> On Fri, Aug 23, 2013 at 5:07 PM, Francisco Webber <[email protected]> wrote:
>> 
>> Just a short post scriptum:
>> 
>> The public version of our API doesn't actually contain the generic 
>> conversion function. But if people from the HTM community want to experiment 
>> just click the "Request for Beta-Program" button and I will upgrade your 
>> accounts manually.
>> 
>> Francisco
>> 
>> 
>> On 24.08.2013, at 01:59, Francisco Webber wrote:
>> 
>> > Jeff,
>> > I thought about this already.
>> > We have a REST API where you can send a word in and get the SDR back, and 
>> > vice versa.
>> > I invite all who want to experiment to try it out.
>> > You just need to get credentials at our website: www.cept.at.
>> >
>> > In mid-term it would be cool to create some sort of evaluation set, that 
>> > could be used to measure progress while improving the CLA.
>> >
>> > We are continuously improving our Retina but the version that is currently 
>> > online works pretty well already.
>> >
>> > I hope that will help
>> >
>> > Francisco
>> >
>> > On 24.08.2013, at 01:46, Jeff Hawkins wrote:
>> >
>> >> Francisco,
>> >> Your work is very cool.  Do you think it would be possible to make 
>> >> available
>> >> your word SDRs (or a sufficient subset of them) for experimentation?  I
>> >> imagine there would be interested in the NuPIC community in training a CLA
>> >> on text using your word SDRs.  You might get some useful results more
>> >> quickly.  You could do this under a research only license or something 
>> >> like
>> >> that.
>> >> Jeff
>> >>
>> >> -----Original Message-----
>> >> From: nupic [mailto:[email protected]] On Behalf Of 
>> >> Francisco
>> >> Webber
>> >> Sent: Wednesday, August 21, 2013 1:01 PM
>> >> To: NuPIC general mailing list.
>> >> Subject: Re: [nupic-dev] HTM in Natural Language Processing
>> >>
>> >> Hello,
>> >> I am one of the founders of CEPT Systems and lead researcher of our retina
>> >> algorithm.
>> >>
>> >> We have developed a method to represent words by a bitmap pattern 
>> >> capturing
>> >> most of its "lexical semantics". (A text sensor) Our word-SDRs fulfill all
>> >> the requirements for "good" HTM input data.
>> >>
>> >> - Words with similar meaning "look" similar
>> >> - If you drop random bits in the representation the semantics remain 
>> >> intact
>> >> - Only a small number (up to 5%) of bits are set in a word-SDR
>> >> - Every bit in the representation corresponds to a specific semantic 
>> >> feature
>> >> of the language used
>> >> - The Retina (sensory organ for a HTM) can be trained on any language
>> >> - The retina training process is fully unsupervised.
>> >>
>> >> We have found out that the word-SDR by itself (without using any HTM yet)
>> >> can improve many NLP problems that are only poorly solved using the
>> >> traditional statistic approaches.
>> >> We use the SDRs to:
>> >> - Create fingerprints of text documents which allows us to compare them 
>> >> for
>> >> semantic similarity using simple (euclidian) similarity measures
>> >> - We can automatically detect polysemy and disambiguate multiple meanings.
>> >> - We can characterize any text with context terms for automatic
>> >> search-engine query-expansion .
>> >>
>> >> We hope to successfully link-up our Retina to an HTM network to go beyond
>> >> lexical semantics into the field of "grammatical semantics".
>> >> This would hopefully lead to improved abstracting-, conversation-, 
>> >> question
>> >> answering- and translation- systems..
>> >>
>> >> Our correct web address is www.cept.at (no kangaroos in Vienna ;-)
>> >>
>> >> I am interested in any form of cooperation to apply HTM technology to 
>> >> text.
>> >>
>> >> Francisco
>> >>
>> >> On 21.08.2013, at 20:16, Christian Cleber Masdeval Braz wrote:
>> >>
>> >>>
>> >>> Hello.
>> >>>
>> >>> As many of you here i am prety new in HTM technology.
>> >>>
>> >>> I am a researcher in Brazil and I am going to start my Phd program soon.
>> >> My field of interest is NLP and the extraction of knowledge from text. I 
>> >> am
>> >> thinking to use the ideas behind the Memory Prediction Framework to
>> >> investigate semantic information retrieval from the Web, and answer
>> >> questions in natural language. I intend to use the HTM implementation as
>> >> base to do this.
>> >>>
>> >>> I apreciate a lot if someone could answer some questions:
>> >>>
>> >>> - Are there some researches related to HTM and NLP? Could indicate them?
>> >>>
>> >>> - Is HTM proper to address this problem? Could it learn, without
>> >> supervision, the grammar of a language or just help in some aspects as 
>> >> Named
>> >> Entity Recognition?
>> >>>
>> >>>
>> >>>
>> >>> Regards,
>> >>>
>> >>> Christian
>> >>>
>> >>>
>> >>> _______________________________________________
>> >>> nupic mailing list
>> >>> [email protected]
>> >>> http://lists.numenta.org/mailman/listinfo/nupic_lists.numenta.org
>> >>
>> >>
>> >> _______________________________________________
>> >> nupic mailing list
>> >> [email protected]
>> >> http://lists.numenta.org/mailman/listinfo/nupic_lists.numenta.org
>> >>
>> >>
>> >> _______________________________________________
>> >> nupic mailing list
>> >> [email protected]
>> >> http://lists.numenta.org/mailman/listinfo/nupic_lists.numenta.org
>> >
>> >
>> > _______________________________________________
>> > nupic mailing list
>> > [email protected]
>> > http://lists.numenta.org/mailman/listinfo/nupic_lists.numenta.org
>> 
>> 
>> _______________________________________________
>> nupic mailing list
>> [email protected]
>> http://lists.numenta.org/mailman/listinfo/nupic_lists.numenta.org
>> 
>>  
>> 
>> _______________________________________________
>> nupic mailing list
>> [email protected]
>> http://lists.numenta.org/mailman/listinfo/nupic_lists.numenta.org
>> 
>>  
>> 
>> 
>> _______________________________________________
>> nupic mailing list
>> [email protected]
>> http://lists.numenta.org/mailman/listinfo/nupic_lists.numenta.org
>> 
>>  
>> 
>> _______________________________________________
>> nupic mailing list
>> [email protected]
>> http://lists.numenta.org/mailman/listinfo/nupic_lists.numenta.org
>> 
>>  
>> 
>> 
>> _______________________________________________
>> nupic mailing list
>> [email protected]
>> http://lists.numenta.org/mailman/listinfo/nupic_lists.numenta.org
>> 
>> 
>> 
>> 
>> -- 
>> James Tauber
>> http://jtauber.com/
>> @jtauber on Twitter
>> _______________________________________________
>> nupic mailing list
>> [email protected]
>> http://lists.numenta.org/mailman/listinfo/nupic_lists.numenta.org
> 
> 
> _______________________________________________
> nupic mailing list
> [email protected]
> http://lists.numenta.org/mailman/listinfo/nupic_lists.numenta.org
> 
> 
> 
> 
> -- 
> James Tauber
> http://jtauber.com/
> @jtauber on Twitter
> _______________________________________________
> nupic mailing list
> [email protected]
> http://lists.numenta.org/mailman/listinfo/nupic_lists.numenta.org

_______________________________________________
nupic mailing list
[email protected]
http://lists.numenta.org/mailman/listinfo/nupic_lists.numenta.org

Re: [nupic-dev] HTM in Natural Language Processing

Reply via email to