Jeff, I actually fully agree with your argumentation why the SP is an important step. I was referring to some earlier discussions here on the list, about skipping the SP when data is already in SDR format. Feeding a 3x3 sensor-patch into each column was meant to be a way of using less columns than the retina resolution.
Francisco Am 27.08.2013 um 21:09 schrieb "Jeff Hawkins" <[email protected]>: > I am not following your logic here. A column in the SP needs to be able to > make connections to a large number of input bits, and then learn to make > connections to a small subset of them. So I don’t understand the 3x3 comment. > > There are several reasons we might want to use a spatial pooler. > > 1) The SP converts the dimension (# bits) of an input into the dimension (# > columns) of the SP. It does this dimension change in an elegant way that > always does a pretty good job. > 2) The SP converts an input of any sparsity into a relatively fixed sparsity > (so the TP can work well). The percentage of active input bits can vary (all > being somewhat sparse) and the SP will make it fixed. > 3) The SP learns what bits in the input are useful for spatial correlations. > It forms connections to these bits and those that don’t correlate are not > used. In general the SP forms columns that represent commonly seen patterns > in the input and this biases the representations passed to the TP. This > comes at the expense of rarely seen patterns. > > All three reasons are valid for using the SP with CEPT’s word SDRs. However, > if we made the number of columns match the number of input bits and we could > force the word SDRs to each have the same number of active bits, then we > could at least try skipping the SP. But it may not be worth it. > Jeff > > From: nupic [mailto:[email protected]] On Behalf Of Francisco > Webber > Sent: Monday, August 26, 2013 11:39 AM > To: NuPIC general mailing list. > Subject: Re: [nupic-dev] HTM in Natural Language Processing > > Jeff, > Im an still not completely convinced that skipping the SP is a good thing to > do. > It is true that when you feed scalars into the system the SP acts like an > SDRizer but in the case of the text-retina we already get SDRs in a first > place. I believe that in this case, the SP learns another aspect of the data, > namely the semantic topology of the input pattern. This leads me to a scheme > where each column gets a field of, lets say, 9 input bits arranged as 3x3 > grid. > depending on the amount of memory one can spend, these 3x3 bits could be fed > in a non overlapping mode. This would mean that the 128x128 sensor bits need > an array of 43x43 colums = 1849. > If we would decide to overlap the 3x3 fields by one bit, the 128x128 sensor > array would be mapped to 64x64 = 4096 columns. > > Francisco > > > > On 26.08.2013, at 20:10, Jeff Hawkins wrote: > > > I am sold on the kid’s story idea. I looked at the link below and there is a > lot of meta data in this file. It would have to be removed before feeding to > the CLA. > > My assumption is that we would need a CLA with more columns than the standard > 2048. How many bits are in your word fingerprints? Could we make each bit a > column and skip the SP? > Jeff > > From: nupic [mailto:[email protected]] On Behalf Of Francisco > Webber > Sent: Monday, August 26, 2013 3:50 AM > To: NuPIC general mailing list. > Subject: Re: [nupic-dev] HTM in Natural Language Processing > > Ian, > I also thought about something from the Gutenberg repository. > But I think we should start with something from the Kids Shelf. > > There are several reasons in my opinion: > > - We start experimentation with a full bag of unknown parameters, so keeping > the test material simple would allow us to detect the important ones sooner. > And it is quite some work to create a reliable evaluation framework, so the > size of the data set makes a difference. > - Keeping the text simple and short reduces substantially the overall > vocabulary. If we want people to also evaluate offline, matching fingerprints > can become a lengthy process without an efficient similarity engine. > - Another reason is the fact that we don't know how much a given set of > columns (like the 2048 typically used) can absorb information. In other > words: what is the optimal ratio between a first layer of a text-HTM and the > amount of text. > - Lastly I believe that the sequence in which text is presented to the CLA is > of importance. After all when humans learn information by reading, they also > start from simple to complex language. The amount of new vocabulary during > training, should be relatively stable (the actual amount would probably be > linked to the ratio of my previous argument) > > So we should build continuously more complex training data sets, finally > ending up with "true" books like the ones you listed. > > To start I would suggest something like: > > A Primary Reader: Old-time Stories, Fairy Tales and Myths Retold by Children > http://www.gutenberg.org/ebooks/7841 > > But there might still be better ones… > > Francisco > > > > On 25.08.2013, at 23:05, Ian Danforth wrote: > > > > I will make 3 suggestions. All are out of copyright, well known, > uncontroversial, and still taught in schools (At least in the US) > > 1. Robinson Crusoe - Daniel Defoe > > http://www.gutenberg.org/ebooks/521 > > 2. Great Expectations - Charles Dickens > > http://www.gutenberg.org/ebooks/1400 > > 3. The Time Machine - H.G. Wells > > http://www.gutenberg.org/ebooks/35 > > Ian > > > On Sat, Aug 24, 2013 at 10:24 AM, Francisco Webber <[email protected]> wrote: > For those who don't want to use the API and for evaluation purposes, I would > propose that we choose some reference text and I convert it into a sequence > of SDRs. This file could be used for training. > I would also generate a list of all words contained in the text, together > with their SDRs to be used as conversion table. > As a simple test measure we could feed a sequence of SDRs into a trained > network and see if the HTM makes the right prediction about the following > word(s). > The last file to produce for a complete framework would be a list of lets say > 100 word sequences with their correct continuation. > The word sequences could be for example the beginnings of phrases with more > than n words (n being the number of steps ahead that the CLA can predict > ahead) > This could be the beginning of a measuring set-up that allows to compare > different CLA-implementation flavors. > > Any suggestions for a text to choose? > > Francisco > > On 24.08.2013, at 17:12, Matthew Taylor wrote: > > Very cool, Francisco. Here is where you can get cept API credentials: > https://cept.3scale.net/signup > > --------- > Matt Taylor > OS Community Flag-Bearer > Numenta > > > On Fri, Aug 23, 2013 at 5:07 PM, Francisco Webber <[email protected]> wrote: > Just a short post scriptum: > > The public version of our API doesn't actually contain the generic conversion > function. But if people from the HTM community want to experiment just click > the "Request for Beta-Program" button and I will upgrade your accounts > manually. > > Francisco > > On 24.08.2013, at 01:59, Francisco Webber wrote: > > > Jeff, > > I thought about this already. > > We have a REST API where you can send a word in and get the SDR back, and > > vice versa. > > I invite all who want to experiment to try it out. > > You just need to get credentials at our website: www.cept.at. > > > > In mid-term it would be cool to create some sort of evaluation set, that > > could be used to measure progress while improving the CLA. > > > > We are continuously improving our Retina but the version that is currently > > online works pretty well already. > > > > I hope that will help > > > > Francisco > > > > On 24.08.2013, at 01:46, Jeff Hawkins wrote: > > > >> Francisco, > >> Your work is very cool. Do you think it would be possible to make > >> available > >> your word SDRs (or a sufficient subset of them) for experimentation? I > >> imagine there would be interested in the NuPIC community in training a CLA > >> on text using your word SDRs. You might get some useful results more > >> quickly. You could do this under a research only license or something like > >> that. > >> Jeff > >> > >> -----Original Message----- > >> From: nupic [mailto:[email protected]] On Behalf Of Francisco > >> Webber > >> Sent: Wednesday, August 21, 2013 1:01 PM > >> To: NuPIC general mailing list. > >> Subject: Re: [nupic-dev] HTM in Natural Language Processing > >> > >> Hello, > >> I am one of the founders of CEPT Systems and lead researcher of our retina > >> algorithm. > >> > >> We have developed a method to represent words by a bitmap pattern capturing > >> most of its "lexical semantics". (A text sensor) Our word-SDRs fulfill all > >> the requirements for "good" HTM input data. > >> > >> - Words with similar meaning "look" similar > >> - If you drop random bits in the representation the semantics remain intact > >> - Only a small number (up to 5%) of bits are set in a word-SDR > >> - Every bit in the representation corresponds to a specific semantic > >> feature > >> of the language used > >> - The Retina (sensory organ for a HTM) can be trained on any language > >> - The retina training process is fully unsupervised. > >> > >> We have found out that the word-SDR by itself (without using any HTM yet) > >> can improve many NLP problems that are only poorly solved using the > >> traditional statistic approaches. > >> We use the SDRs to: > >> - Create fingerprints of text documents which allows us to compare them for > >> semantic similarity using simple (euclidian) similarity measures > >> - We can automatically detect polysemy and disambiguate multiple meanings. > >> - We can characterize any text with context terms for automatic > >> search-engine query-expansion . > >> > >> We hope to successfully link-up our Retina to an HTM network to go beyond > >> lexical semantics into the field of "grammatical semantics". > >> This would hopefully lead to improved abstracting-, conversation-, question > >> answering- and translation- systems.. > >> > >> Our correct web address is www.cept.at (no kangaroos in Vienna ;-) > >> > >> I am interested in any form of cooperation to apply HTM technology to text. > >> > >> Francisco > >> > >> On 21.08.2013, at 20:16, Christian Cleber Masdeval Braz wrote: > >> > >>> > >>> Hello. > >>> > >>> As many of you here i am prety new in HTM technology. > >>> > >>> I am a researcher in Brazil and I am going to start my Phd program soon. > >> My field of interest is NLP and the extraction of knowledge from text. I am > >> thinking to use the ideas behind the Memory Prediction Framework to > >> investigate semantic information retrieval from the Web, and answer > >> questions in natural language. I intend to use the HTM implementation as > >> base to do this. > >>> > >>> I apreciate a lot if someone could answer some questions: > >>> > >>> - Are there some researches related to HTM and NLP? Could indicate them? > >>> > >>> - Is HTM proper to address this problem? Could it learn, without > >> supervision, the grammar of a language or just help in some aspects as > >> Named > >> Entity Recognition? > >>> > >>> > >>> > >>> Regards, > >>> > >>> Christian > >>> > >>> > >>> _______________________________________________ > >>> nupic mailing list > >>> [email protected] > >>> http://lists.numenta.org/mailman/listinfo/nupic_lists.numenta.org > >> > >> > >> _______________________________________________ > >> nupic mailing list > >> [email protected] > >> http://lists.numenta.org/mailman/listinfo/nupic_lists.numenta.org > >> > >> > >> _______________________________________________ > >> nupic mailing list > >> [email protected] > >> http://lists.numenta.org/mailman/listinfo/nupic_lists.numenta.org > > > > > > _______________________________________________ > > nupic mailing list > > [email protected] > > http://lists.numenta.org/mailman/listinfo/nupic_lists.numenta.org > > > _______________________________________________ > nupic mailing list > [email protected] > http://lists.numenta.org/mailman/listinfo/nupic_lists.numenta.org > > _______________________________________________ > nupic mailing list > [email protected] > http://lists.numenta.org/mailman/listinfo/nupic_lists.numenta.org > > > _______________________________________________ > nupic mailing list > [email protected] > http://lists.numenta.org/mailman/listinfo/nupic_lists.numenta.org > > > _______________________________________________ > nupic mailing list > [email protected] > http://lists.numenta.org/mailman/listinfo/nupic_lists.numenta.org > > _______________________________________________ > nupic mailing list > [email protected] > http://lists.numenta.org/mailman/listinfo/nupic_lists.numenta.org > > _______________________________________________ > nupic mailing list > [email protected] > http://lists.numenta.org/mailman/listinfo/nupic_lists.numenta.org
_______________________________________________ nupic mailing list [email protected] http://lists.numenta.org/mailman/listinfo/nupic_lists.numenta.org
