Re: [nupic-dev] HTM in Natural Language Processing

Francisco Webber Mon, 26 Aug 2013 11:39:37 -0700

Jeff,
Im an still not completely convinced that skipping the SP is a good thing to do.
It is true that when you feed scalars into the system the SP acts like an 
SDRizer but in the case of the text-retina we already get SDRs in a first 
place. I believe that in this case, the SP learns another aspect of the data, 
namely the semantic topology of the input pattern. This leads me to a scheme 
where each column gets a field of, lets say, 9 input bits arranged as 3x3 grid.
depending on the amount of memory one can spend, these 3x3 bits could be fed in 
a non overlapping mode. This would mean that the 128x128 sensor bits need an 
array of 43x43 colums = 1849.
If we would decide to overlap the 3x3 fields by one bit, the 128x128 sensor 
array would be mapped to 64x64  = 4096 columns.


Francisco



On 26.08.2013, at 20:10, Jeff Hawkins wrote:

> I am sold on the kid’s story idea.  I looked at the link below and there is a 
> lot of meta data in this file.  It would have to be removed before feeding to 
> the CLA.
>  
> My assumption is that we would need a CLA with more columns than the standard 
> 2048.  How many bits are in your word fingerprints?  Could we make each bit a 
> column and skip the SP?
> Jeff
>  
> From: nupic [mailto:[email protected]] On Behalf Of Francisco 
> Webber
> Sent: Monday, August 26, 2013 3:50 AM
> To: NuPIC general mailing list.
> Subject: Re: [nupic-dev] HTM in Natural Language Processing
>  
> Ian,
> I also thought about something from the Gutenberg repository.
> But I think we should start with something from the Kids Shelf.
>  
> There are several reasons in my opinion:
>  
> - We start experimentation with a full bag of unknown parameters, so keeping 
> the test material simple would allow us to detect the important ones sooner. 
> And it is quite some work to create a reliable evaluation framework, so the 
> size of the data set makes a difference.
> - Keeping the text simple and short reduces substantially the overall 
> vocabulary. If we want people to also evaluate offline, matching fingerprints 
> can become a lengthy process without an efficient similarity engine.
> - Another reason is the fact that we don't know how much a given set of 
> columns (like the 2048 typically used) can absorb information. In other 
> words: what is the optimal ratio between a first layer of a text-HTM and the 
> amount of text.
> - Lastly I believe that the sequence in which text is presented to the CLA is 
> of importance. After all when humans learn information by reading, they also 
> start from simple to complex language. The amount of new vocabulary during 
> training, should be relatively stable (the actual amount would probably be 
> linked to the ratio of my previous argument) 
>  
> So we should build continuously more complex training data sets, finally 
> ending up with "true"  books like the ones you listed.
>  
> To start I would suggest something like:
>  
> A Primary Reader: Old-time Stories, Fairy Tales and Myths Retold by Children
> http://www.gutenberg.org/ebooks/7841
>  
> But there might still be better ones…
>  
> Francisco
>  
>  
>  
> On 25.08.2013, at 23:05, Ian Danforth wrote:
> 
> 
> I will make 3 suggestions. All are out of copyright, well known, 
> uncontroversial, and still taught in schools (At least in the US)
>  
> 1. Robinson Crusoe - Daniel Defoe
>  
> http://www.gutenberg.org/ebooks/521
>  
> 2. Great Expectations - Charles Dickens
>  
> http://www.gutenberg.org/ebooks/1400
>  
> 3. The Time Machine - H.G. Wells
>  
> http://www.gutenberg.org/ebooks/35
>  
> Ian
>  
> 
> On Sat, Aug 24, 2013 at 10:24 AM, Francisco Webber <[email protected]> wrote:
> For those who don't want to use the API and for evaluation purposes, I would 
> propose that we choose some reference text and I convert it into a sequence 
> of SDRs. This file could be used for training.
> I would also generate a list of all words contained in the text, together 
> with their SDRs to be used as conversion table.
> As a simple test measure we could feed a sequence of SDRs into a trained 
> network and see if the HTM makes the right prediction about the following 
> word(s). 
> The last file to produce for a complete framework would be a list of lets say 
> 100 word sequences with their correct continuation.
> The word sequences could be for example the beginnings of phrases with more 
> than n words (n being the number of steps ahead that the CLA can predict 
> ahead)
> This could be the beginning of a measuring set-up that allows to compare 
> different CLA-implementation flavors.
>  
> Any suggestions for a text to choose?
>  
> Francisco
>  
> On 24.08.2013, at 17:12, Matthew Taylor wrote:
>  
> Very cool, Francisco. Here is where you can get cept API credentials: 
> https://cept.3scale.net/signup
> 
> ---------
> Matt Taylor
> OS Community Flag-Bearer
> Numenta
>  
> 
> On Fri, Aug 23, 2013 at 5:07 PM, Francisco Webber <[email protected]> wrote:
> Just a short post scriptum:
> 
> The public version of our API doesn't actually contain the generic conversion 
> function. But if people from the HTM community want to experiment just click 
> the "Request for Beta-Program" button and I will upgrade your accounts 
> manually.
> 
> Francisco
> 
> On 24.08.2013, at 01:59, Francisco Webber wrote:
> 
> > Jeff,
> > I thought about this already.
> > We have a REST API where you can send a word in and get the SDR back, and 
> > vice versa.
> > I invite all who want to experiment to try it out.
> > You just need to get credentials at our website: www.cept.at.
> >
> > In mid-term it would be cool to create some sort of evaluation set, that 
> > could be used to measure progress while improving the CLA.
> >
> > We are continuously improving our Retina but the version that is currently 
> > online works pretty well already.
> >
> > I hope that will help
> >
> > Francisco
> >
> > On 24.08.2013, at 01:46, Jeff Hawkins wrote:
> >
> >> Francisco,
> >> Your work is very cool.  Do you think it would be possible to make 
> >> available
> >> your word SDRs (or a sufficient subset of them) for experimentation?  I
> >> imagine there would be interested in the NuPIC community in training a CLA
> >> on text using your word SDRs.  You might get some useful results more
> >> quickly.  You could do this under a research only license or something like
> >> that.
> >> Jeff
> >>
> >> -----Original Message-----
> >> From: nupic [mailto:[email protected]] On Behalf Of Francisco
> >> Webber
> >> Sent: Wednesday, August 21, 2013 1:01 PM
> >> To: NuPIC general mailing list.
> >> Subject: Re: [nupic-dev] HTM in Natural Language Processing
> >>
> >> Hello,
> >> I am one of the founders of CEPT Systems and lead researcher of our retina
> >> algorithm.
> >>
> >> We have developed a method to represent words by a bitmap pattern capturing
> >> most of its "lexical semantics". (A text sensor) Our word-SDRs fulfill all
> >> the requirements for "good" HTM input data.
> >>
> >> - Words with similar meaning "look" similar
> >> - If you drop random bits in the representation the semantics remain intact
> >> - Only a small number (up to 5%) of bits are set in a word-SDR
> >> - Every bit in the representation corresponds to a specific semantic 
> >> feature
> >> of the language used
> >> - The Retina (sensory organ for a HTM) can be trained on any language
> >> - The retina training process is fully unsupervised.
> >>
> >> We have found out that the word-SDR by itself (without using any HTM yet)
> >> can improve many NLP problems that are only poorly solved using the
> >> traditional statistic approaches.
> >> We use the SDRs to:
> >> - Create fingerprints of text documents which allows us to compare them for
> >> semantic similarity using simple (euclidian) similarity measures
> >> - We can automatically detect polysemy and disambiguate multiple meanings.
> >> - We can characterize any text with context terms for automatic
> >> search-engine query-expansion .
> >>
> >> We hope to successfully link-up our Retina to an HTM network to go beyond
> >> lexical semantics into the field of "grammatical semantics".
> >> This would hopefully lead to improved abstracting-, conversation-, question
> >> answering- and translation- systems..
> >>
> >> Our correct web address is www.cept.at (no kangaroos in Vienna ;-)
> >>
> >> I am interested in any form of cooperation to apply HTM technology to text.
> >>
> >> Francisco
> >>
> >> On 21.08.2013, at 20:16, Christian Cleber Masdeval Braz wrote:
> >>
> >>>
> >>> Hello.
> >>>
> >>> As many of you here i am prety new in HTM technology.
> >>>
> >>> I am a researcher in Brazil and I am going to start my Phd program soon.
> >> My field of interest is NLP and the extraction of knowledge from text. I am
> >> thinking to use the ideas behind the Memory Prediction Framework to
> >> investigate semantic information retrieval from the Web, and answer
> >> questions in natural language. I intend to use the HTM implementation as
> >> base to do this.
> >>>
> >>> I apreciate a lot if someone could answer some questions:
> >>>
> >>> - Are there some researches related to HTM and NLP? Could indicate them?
> >>>
> >>> - Is HTM proper to address this problem? Could it learn, without
> >> supervision, the grammar of a language or just help in some aspects as 
> >> Named
> >> Entity Recognition?
> >>>
> >>>
> >>>
> >>> Regards,
> >>>
> >>> Christian
> >>>
> >>>
> >>> _______________________________________________
> >>> nupic mailing list
> >>> [email protected]
> >>> http://lists.numenta.org/mailman/listinfo/nupic_lists.numenta.org
> >>
> >>
> >> _______________________________________________
> >> nupic mailing list
> >> [email protected]
> >> http://lists.numenta.org/mailman/listinfo/nupic_lists.numenta.org
> >>
> >>
> >> _______________________________________________
> >> nupic mailing list
> >> [email protected]
> >> http://lists.numenta.org/mailman/listinfo/nupic_lists.numenta.org
> >
> >
> > _______________________________________________
> > nupic mailing list
> > [email protected]
> > http://lists.numenta.org/mailman/listinfo/nupic_lists.numenta.org
> 
> 
> _______________________________________________
> nupic mailing list
> [email protected]
> http://lists.numenta.org/mailman/listinfo/nupic_lists.numenta.org
>  
> _______________________________________________
> nupic mailing list
> [email protected]
> http://lists.numenta.org/mailman/listinfo/nupic_lists.numenta.org
>  
> 
> _______________________________________________
> nupic mailing list
> [email protected]
> http://lists.numenta.org/mailman/listinfo/nupic_lists.numenta.org
> 
>  
> _______________________________________________
> nupic mailing list
> [email protected]
> http://lists.numenta.org/mailman/listinfo/nupic_lists.numenta.org
>  
> _______________________________________________
> nupic mailing list
> [email protected]
> http://lists.numenta.org/mailman/listinfo/nupic_lists.numenta.org

_______________________________________________
nupic mailing list
[email protected]
http://lists.numenta.org/mailman/listinfo/nupic_lists.numenta.org

Re: [nupic-dev] HTM in Natural Language Processing

Reply via email to