I've removed the metadata, the vocab lists and the illustrations: https://gist.github.com/jtauber/6347309
James On Mon, Aug 26, 2013 at 2:10 PM, Jeff Hawkins <[email protected]> wrote: > I am sold on the kid’s story idea. I looked at the link below and there > is a lot of meta data in this file. It would have to be removed before > feeding to the CLA.**** > > ** ** > > My assumption is that we would need a CLA with more columns than the > standard 2048. How many bits are in your word fingerprints? Could we make > each bit a column and skip the SP?**** > > Jeff**** > > ** ** > > *From:* nupic [mailto:[email protected]] *On Behalf Of > *Francisco > Webber > *Sent:* Monday, August 26, 2013 3:50 AM > > *To:* NuPIC general mailing list. > *Subject:* Re: [nupic-dev] HTM in Natural Language Processing**** > > ** ** > > Ian,**** > > I also thought about something from the Gutenberg repository.**** > > But I think we should start with something from the Kids Shelf.**** > > ** ** > > There are several reasons in my opinion:**** > > ** ** > > - We start experimentation with a full bag of unknown parameters, so > keeping the test material simple would allow us to detect the important > ones sooner. And it is quite some work to create a reliable evaluation > framework, so the size of the data set makes a difference.**** > > - Keeping the text simple and short reduces substantially the overall > vocabulary. If we want people to also evaluate offline, matching > fingerprints can become a lengthy process without an efficient similarity > engine.**** > > - Another reason is the fact that we don't know how much a given set of > columns (like the 2048 typically used) can absorb information. In other > words: what is the optimal ratio between a first layer of a text-HTM and > the amount of text.**** > > - Lastly I believe that the sequence in which text is presented to the CLA > is of importance. After all when humans learn information by reading, they > also start from simple to complex language. The amount of new vocabulary > during training, should be relatively stable (the actual amount would > probably be linked to the ratio of my previous argument) **** > > ** ** > > So we should build continuously more complex training data sets, finally > ending up with "true" books like the ones you listed.**** > > ** ** > > To start I would suggest something like:**** > > ** ** > > A Primary Reader: Old-time Stories, Fairy Tales and Myths Retold by > Children**** > > http://www.gutenberg.org/ebooks/7841**** > > ** ** > > But there might still be better ones…**** > > ** ** > > Francisco**** > > ** ** > > **** > > ** ** > > On 25.08.2013, at 23:05, Ian Danforth wrote:**** > > > > **** > > I will make 3 suggestions. All are out of copyright, well known, > uncontroversial, and still taught in schools (At least in the US)**** > > ** ** > > 1. Robinson Crusoe - Daniel Defoe**** > > ** ** > > http://www.gutenberg.org/ebooks/521**** > > ** ** > > 2. Great Expectations - Charles Dickens**** > > ** ** > > http://www.gutenberg.org/ebooks/1400**** > > ** ** > > 3. The Time Machine - H.G. Wells**** > > ** ** > > http://www.gutenberg.org/ebooks/35**** > > ** ** > > Ian**** > > ** ** > > On Sat, Aug 24, 2013 at 10:24 AM, Francisco Webber <[email protected]> > wrote:**** > > For those who don't want to use the API and for evaluation purposes, I > would propose that we choose some reference text and I convert it into a > sequence of SDRs. This file could be used for training.**** > > I would also generate a list of all words contained in the text, together > with their SDRs to be used as conversion table.**** > > As a simple test measure we could feed a sequence of SDRs into a trained > network and see if the HTM makes the right prediction about the following > word(s). **** > > The last file to produce for a complete framework would be a list of lets > say 100 word sequences with their correct continuation.**** > > The word sequences could be for example the beginnings of phrases with > more than n words (n being the number of steps ahead that the CLA can > predict ahead)**** > > This could be the beginning of a measuring set-up that allows to compare > different CLA-implementation flavors.**** > > ** ** > > Any suggestions for a text to choose?**** > > ** ** > > Francisco**** > > ** ** > > On 24.08.2013, at 17:12, Matthew Taylor wrote:**** > > ** ** > > Very cool, Francisco. Here is where you can get cept API credentials: > https://cept.3scale.net/signup**** > > > **** > > ---------**** > > Matt Taylor**** > > OS Community Flag-Bearer**** > > Numenta**** > > ** ** > > On Fri, Aug 23, 2013 at 5:07 PM, Francisco Webber <[email protected]> > wrote:**** > > Just a short post scriptum: > > The public version of our API doesn't actually contain the generic > conversion function. But if people from the HTM community want to > experiment just click the "Request for Beta-Program" button and I will > upgrade your accounts manually. > > Francisco**** > > > On 24.08.2013, at 01:59, Francisco Webber wrote: > > > Jeff, > > I thought about this already. > > We have a REST API where you can send a word in and get the SDR back, > and vice versa. > > I invite all who want to experiment to try it out. > > You just need to get credentials at our website: www.cept.at. > > > > In mid-term it would be cool to create some sort of evaluation set, that > could be used to measure progress while improving the CLA. > > > > We are continuously improving our Retina but the version that is > currently online works pretty well already. > > > > I hope that will help > > > > Francisco > > > > On 24.08.2013, at 01:46, Jeff Hawkins wrote: > > > >> Francisco, > >> Your work is very cool. Do you think it would be possible to make > available > >> your word SDRs (or a sufficient subset of them) for experimentation? I > >> imagine there would be interested in the NuPIC community in training a > CLA > >> on text using your word SDRs. You might get some useful results more > >> quickly. You could do this under a research only license or something > like > >> that. > >> Jeff > >> > >> -----Original Message----- > >> From: nupic [mailto:[email protected]] On Behalf Of > Francisco > >> Webber > >> Sent: Wednesday, August 21, 2013 1:01 PM > >> To: NuPIC general mailing list. > >> Subject: Re: [nupic-dev] HTM in Natural Language Processing > >> > >> Hello, > >> I am one of the founders of CEPT Systems and lead researcher of our > retina > >> algorithm. > >> > >> We have developed a method to represent words by a bitmap pattern > capturing > >> most of its "lexical semantics". (A text sensor) Our word-SDRs fulfill > all > >> the requirements for "good" HTM input data. > >> > >> - Words with similar meaning "look" similar > >> - If you drop random bits in the representation the semantics remain > intact > >> - Only a small number (up to 5%) of bits are set in a word-SDR > >> - Every bit in the representation corresponds to a specific semantic > feature > >> of the language used > >> - The Retina (sensory organ for a HTM) can be trained on any language > >> - The retina training process is fully unsupervised. > >> > >> We have found out that the word-SDR by itself (without using any HTM > yet) > >> can improve many NLP problems that are only poorly solved using the > >> traditional statistic approaches. > >> We use the SDRs to: > >> - Create fingerprints of text documents which allows us to compare them > for > >> semantic similarity using simple (euclidian) similarity measures > >> - We can automatically detect polysemy and disambiguate multiple > meanings. > >> - We can characterize any text with context terms for automatic > >> search-engine query-expansion . > >> > >> We hope to successfully link-up our Retina to an HTM network to go > beyond > >> lexical semantics into the field of "grammatical semantics". > >> This would hopefully lead to improved abstracting-, conversation-, > question > >> answering- and translation- systems.. > >> > >> Our correct web address is www.cept.at (no kangaroos in Vienna ;-) > >> > >> I am interested in any form of cooperation to apply HTM technology to > text. > >> > >> Francisco > >> > >> On 21.08.2013, at 20:16, Christian Cleber Masdeval Braz wrote: > >> > >>> > >>> Hello. > >>> > >>> As many of you here i am prety new in HTM technology. > >>> > >>> I am a researcher in Brazil and I am going to start my Phd program > soon. > >> My field of interest is NLP and the extraction of knowledge from text. > I am > >> thinking to use the ideas behind the Memory Prediction Framework to > >> investigate semantic information retrieval from the Web, and answer > >> questions in natural language. I intend to use the HTM implementation as > >> base to do this. > >>> > >>> I apreciate a lot if someone could answer some questions: > >>> > >>> - Are there some researches related to HTM and NLP? Could indicate > them? > >>> > >>> - Is HTM proper to address this problem? Could it learn, without > >> supervision, the grammar of a language or just help in some aspects as > Named > >> Entity Recognition? > >>> > >>> > >>> > >>> Regards, > >>> > >>> Christian > >>> > >>> > >>> _______________________________________________ > >>> nupic mailing list > >>> [email protected] > >>> http://lists.numenta.org/mailman/listinfo/nupic_lists.numenta.org > >> > >> > >> _______________________________________________ > >> nupic mailing list > >> [email protected] > >> http://lists.numenta.org/mailman/listinfo/nupic_lists.numenta.org > >> > >> > >> _______________________________________________ > >> nupic mailing list > >> [email protected] > >> http://lists.numenta.org/mailman/listinfo/nupic_lists.numenta.org > > > > > > _______________________________________________ > > nupic mailing list > > [email protected] > > http://lists.numenta.org/mailman/listinfo/nupic_lists.numenta.org > > > _______________________________________________ > nupic mailing list > [email protected] > http://lists.numenta.org/mailman/listinfo/nupic_lists.numenta.org**** > > ** ** > > _______________________________________________ > nupic mailing list > [email protected] > http://lists.numenta.org/mailman/listinfo/nupic_lists.numenta.org**** > > ** ** > > > _______________________________________________ > nupic mailing list > [email protected] > http://lists.numenta.org/mailman/listinfo/nupic_lists.numenta.org**** > > ** ** > > _______________________________________________ > nupic mailing list > [email protected] > http://lists.numenta.org/mailman/listinfo/nupic_lists.numenta.org**** > > ** ** > > _______________________________________________ > nupic mailing list > [email protected] > http://lists.numenta.org/mailman/listinfo/nupic_lists.numenta.org > > -- James Tauber http://jtauber.com/ @jtauber on Twitter
_______________________________________________ nupic mailing list [email protected] http://lists.numenta.org/mailman/listinfo/nupic_lists.numenta.org
