James, thats great! I think that there are some more preparations necessary: - All CRLF should be removed. Keeping one blank after each full stop. (This makes it easier for most parsers) - The line of asterisks should be replaced by a CRLF to mark the paragraphs. (We never know but we could need paragraph info at some time) - The file as such should be split into single tales. (Whatever experiments we run, if we rerun them with different tales, results become more comparable) - The title should not be written in caps. (Capital letter+Full Stop is interpreted as acronym or middle name instead of a sentence delimiter)
Francisco Am 27.08.2013 um 00:22 schrieb James Tauber <[email protected]>: > I've removed the metadata, the vocab lists and the illustrations: > > https://gist.github.com/jtauber/6347309 > > James > > > On Mon, Aug 26, 2013 at 2:10 PM, Jeff Hawkins <[email protected]> wrote: > I am sold on the kid’s story idea. I looked at the link below and there is a > lot of meta data in this file. It would have to be removed before feeding to > the CLA. > > > > My assumption is that we would need a CLA with more columns than the standard > 2048. How many bits are in your word fingerprints? Could we make each bit a > column and skip the SP? > > Jeff > > > > From: nupic [mailto:[email protected]] On Behalf Of Francisco > Webber > Sent: Monday, August 26, 2013 3:50 AM > > > To: NuPIC general mailing list. > Subject: Re: [nupic-dev] HTM in Natural Language Processing > > > > Ian, > > I also thought about something from the Gutenberg repository. > > But I think we should start with something from the Kids Shelf. > > > > There are several reasons in my opinion: > > > > - We start experimentation with a full bag of unknown parameters, so keeping > the test material simple would allow us to detect the important ones sooner. > And it is quite some work to create a reliable evaluation framework, so the > size of the data set makes a difference. > > - Keeping the text simple and short reduces substantially the overall > vocabulary. If we want people to also evaluate offline, matching fingerprints > can become a lengthy process without an efficient similarity engine. > > - Another reason is the fact that we don't know how much a given set of > columns (like the 2048 typically used) can absorb information. In other > words: what is the optimal ratio between a first layer of a text-HTM and the > amount of text. > > - Lastly I believe that the sequence in which text is presented to the CLA is > of importance. After all when humans learn information by reading, they also > start from simple to complex language. The amount of new vocabulary during > training, should be relatively stable (the actual amount would probably be > linked to the ratio of my previous argument) > > > > So we should build continuously more complex training data sets, finally > ending up with "true" books like the ones you listed. > > > > To start I would suggest something like: > > > > A Primary Reader: Old-time Stories, Fairy Tales and Myths Retold by Children > > http://www.gutenberg.org/ebooks/7841 > > > > But there might still be better ones… > > > > Francisco > > > > > > > > On 25.08.2013, at 23:05, Ian Danforth wrote: > > > > > I will make 3 suggestions. All are out of copyright, well known, > uncontroversial, and still taught in schools (At least in the US) > > > > 1. Robinson Crusoe - Daniel Defoe > > > > http://www.gutenberg.org/ebooks/521 > > > > 2. Great Expectations - Charles Dickens > > > > http://www.gutenberg.org/ebooks/1400 > > > > 3. The Time Machine - H.G. Wells > > > > http://www.gutenberg.org/ebooks/35 > > > > Ian > > > > On Sat, Aug 24, 2013 at 10:24 AM, Francisco Webber <[email protected]> wrote: > > For those who don't want to use the API and for evaluation purposes, I would > propose that we choose some reference text and I convert it into a sequence > of SDRs. This file could be used for training. > > I would also generate a list of all words contained in the text, together > with their SDRs to be used as conversion table. > > As a simple test measure we could feed a sequence of SDRs into a trained > network and see if the HTM makes the right prediction about the following > word(s). > > The last file to produce for a complete framework would be a list of lets say > 100 word sequences with their correct continuation. > > The word sequences could be for example the beginnings of phrases with more > than n words (n being the number of steps ahead that the CLA can predict > ahead) > > This could be the beginning of a measuring set-up that allows to compare > different CLA-implementation flavors. > > > > Any suggestions for a text to choose? > > > > Francisco > > > > On 24.08.2013, at 17:12, Matthew Taylor wrote: > > > > Very cool, Francisco. Here is where you can get cept API credentials: > https://cept.3scale.net/signup > > > > --------- > > Matt Taylor > > OS Community Flag-Bearer > > Numenta > > > > On Fri, Aug 23, 2013 at 5:07 PM, Francisco Webber <[email protected]> wrote: > > Just a short post scriptum: > > The public version of our API doesn't actually contain the generic conversion > function. But if people from the HTM community want to experiment just click > the "Request for Beta-Program" button and I will upgrade your accounts > manually. > > Francisco > > > On 24.08.2013, at 01:59, Francisco Webber wrote: > > > Jeff, > > I thought about this already. > > We have a REST API where you can send a word in and get the SDR back, and > > vice versa. > > I invite all who want to experiment to try it out. > > You just need to get credentials at our website: www.cept.at. > > > > In mid-term it would be cool to create some sort of evaluation set, that > > could be used to measure progress while improving the CLA. > > > > We are continuously improving our Retina but the version that is currently > > online works pretty well already. > > > > I hope that will help > > > > Francisco > > > > On 24.08.2013, at 01:46, Jeff Hawkins wrote: > > > >> Francisco, > >> Your work is very cool. Do you think it would be possible to make > >> available > >> your word SDRs (or a sufficient subset of them) for experimentation? I > >> imagine there would be interested in the NuPIC community in training a CLA > >> on text using your word SDRs. You might get some useful results more > >> quickly. You could do this under a research only license or something like > >> that. > >> Jeff > >> > >> -----Original Message----- > >> From: nupic [mailto:[email protected]] On Behalf Of Francisco > >> Webber > >> Sent: Wednesday, August 21, 2013 1:01 PM > >> To: NuPIC general mailing list. > >> Subject: Re: [nupic-dev] HTM in Natural Language Processing > >> > >> Hello, > >> I am one of the founders of CEPT Systems and lead researcher of our retina > >> algorithm. > >> > >> We have developed a method to represent words by a bitmap pattern capturing > >> most of its "lexical semantics". (A text sensor) Our word-SDRs fulfill all > >> the requirements for "good" HTM input data. > >> > >> - Words with similar meaning "look" similar > >> - If you drop random bits in the representation the semantics remain intact > >> - Only a small number (up to 5%) of bits are set in a word-SDR > >> - Every bit in the representation corresponds to a specific semantic > >> feature > >> of the language used > >> - The Retina (sensory organ for a HTM) can be trained on any language > >> - The retina training process is fully unsupervised. > >> > >> We have found out that the word-SDR by itself (without using any HTM yet) > >> can improve many NLP problems that are only poorly solved using the > >> traditional statistic approaches. > >> We use the SDRs to: > >> - Create fingerprints of text documents which allows us to compare them for > >> semantic similarity using simple (euclidian) similarity measures > >> - We can automatically detect polysemy and disambiguate multiple meanings. > >> - We can characterize any text with context terms for automatic > >> search-engine query-expansion . > >> > >> We hope to successfully link-up our Retina to an HTM network to go beyond > >> lexical semantics into the field of "grammatical semantics". > >> This would hopefully lead to improved abstracting-, conversation-, question > >> answering- and translation- systems.. > >> > >> Our correct web address is www.cept.at (no kangaroos in Vienna ;-) > >> > >> I am interested in any form of cooperation to apply HTM technology to text. > >> > >> Francisco > >> > >> On 21.08.2013, at 20:16, Christian Cleber Masdeval Braz wrote: > >> > >>> > >>> Hello. > >>> > >>> As many of you here i am prety new in HTM technology. > >>> > >>> I am a researcher in Brazil and I am going to start my Phd program soon. > >> My field of interest is NLP and the extraction of knowledge from text. I am > >> thinking to use the ideas behind the Memory Prediction Framework to > >> investigate semantic information retrieval from the Web, and answer > >> questions in natural language. I intend to use the HTM implementation as > >> base to do this. > >>> > >>> I apreciate a lot if someone could answer some questions: > >>> > >>> - Are there some researches related to HTM and NLP? Could indicate them? > >>> > >>> - Is HTM proper to address this problem? Could it learn, without > >> supervision, the grammar of a language or just help in some aspects as > >> Named > >> Entity Recognition? > >>> > >>> > >>> > >>> Regards, > >>> > >>> Christian > >>> > >>> > >>> _______________________________________________ > >>> nupic mailing list > >>> [email protected] > >>> http://lists.numenta.org/mailman/listinfo/nupic_lists.numenta.org > >> > >> > >> _______________________________________________ > >> nupic mailing list > >> [email protected] > >> http://lists.numenta.org/mailman/listinfo/nupic_lists.numenta.org > >> > >> > >> _______________________________________________ > >> nupic mailing list > >> [email protected] > >> http://lists.numenta.org/mailman/listinfo/nupic_lists.numenta.org > > > > > > _______________________________________________ > > nupic mailing list > > [email protected] > > http://lists.numenta.org/mailman/listinfo/nupic_lists.numenta.org > > > _______________________________________________ > nupic mailing list > [email protected] > http://lists.numenta.org/mailman/listinfo/nupic_lists.numenta.org > > > > _______________________________________________ > nupic mailing list > [email protected] > http://lists.numenta.org/mailman/listinfo/nupic_lists.numenta.org > > > > > _______________________________________________ > nupic mailing list > [email protected] > http://lists.numenta.org/mailman/listinfo/nupic_lists.numenta.org > > > > _______________________________________________ > nupic mailing list > [email protected] > http://lists.numenta.org/mailman/listinfo/nupic_lists.numenta.org > > > > > _______________________________________________ > nupic mailing list > [email protected] > http://lists.numenta.org/mailman/listinfo/nupic_lists.numenta.org > > > > > -- > James Tauber > http://jtauber.com/ > @jtauber on Twitter > _______________________________________________ > nupic mailing list > [email protected] > http://lists.numenta.org/mailman/listinfo/nupic_lists.numenta.org
_______________________________________________ nupic mailing list [email protected] http://lists.numenta.org/mailman/listinfo/nupic_lists.numenta.org
