Let me know if this is what you had in mind (just the ugly duckling): https://gist.github.com/jtauber/6347309#file-the_ugly_duckling-txt
I put each paragraph on its own line and separated the sections (that formerly were separated by a row of asterisks) with a blank line. James On Tue, Aug 27, 2013 at 7:59 AM, Francisco De Sousa Webber <[email protected] > wrote: > James, > thats great! > I think that there are some more preparations necessary: > - All CRLF should be removed. Keeping one blank after each full stop. > (This makes it easier for most parsers) > - The line of asterisks should be replaced by a CRLF to mark the > paragraphs. (We never know but we could need paragraph info at some time) > - The file as such should be split into single tales. (Whatever > experiments we run, if we rerun them with different tales, results become > more comparable) > - The title should not be written in caps. (Capital letter+Full Stop is > interpreted as acronym or middle name instead of a sentence delimiter) > > Francisco > > > Am 27.08.2013 um 00:22 schrieb James Tauber <[email protected]>: > > I've removed the metadata, the vocab lists and the illustrations: > > https://gist.github.com/jtauber/6347309 > > James > > > On Mon, Aug 26, 2013 at 2:10 PM, Jeff Hawkins <[email protected]>wrote: > >> I am sold on the kid’s story idea. I looked at the link below and there >> is a lot of meta data in this file. It would have to be removed before >> feeding to the CLA.**** >> >> ** ** >> >> My assumption is that we would need a CLA with more columns than the >> standard 2048. How many bits are in your word fingerprints? Could we make >> each bit a column and skip the SP?**** >> >> Jeff**** >> >> ** ** >> >> *From:* nupic [mailto:[email protected]] *On Behalf Of >> *Francisco >> Webber >> *Sent:* Monday, August 26, 2013 3:50 AM >> >> *To:* NuPIC general mailing list. >> *Subject:* Re: [nupic-dev] HTM in Natural Language Processing**** >> >> ** ** >> >> Ian,**** >> >> I also thought about something from the Gutenberg repository.**** >> >> But I think we should start with something from the Kids Shelf.**** >> >> ** ** >> >> There are several reasons in my opinion:**** >> >> ** ** >> >> - We start experimentation with a full bag of unknown parameters, so >> keeping the test material simple would allow us to detect the important >> ones sooner. And it is quite some work to create a reliable evaluation >> framework, so the size of the data set makes a difference.**** >> >> - Keeping the text simple and short reduces substantially the overall >> vocabulary. If we want people to also evaluate offline, matching >> fingerprints can become a lengthy process without an efficient similarity >> engine.**** >> >> - Another reason is the fact that we don't know how much a given set of >> columns (like the 2048 typically used) can absorb information. In other >> words: what is the optimal ratio between a first layer of a text-HTM and >> the amount of text.**** >> >> - Lastly I believe that the sequence in which text is presented to the >> CLA is of importance. After all when humans learn information by reading, >> they also start from simple to complex language. The amount of new >> vocabulary during training, should be relatively stable (the actual amount >> would probably be linked to the ratio of my previous argument) **** >> >> ** ** >> >> So we should build continuously more complex training data sets, finally >> ending up with "true" books like the ones you listed.**** >> >> ** ** >> >> To start I would suggest something like:**** >> >> ** ** >> >> A Primary Reader: Old-time Stories, Fairy Tales and Myths Retold by >> Children**** >> >> http://www.gutenberg.org/ebooks/7841**** >> >> ** ** >> >> But there might still be better ones…**** >> >> ** ** >> >> Francisco**** >> >> ** ** >> >> **** >> >> ** ** >> >> On 25.08.2013, at 23:05, Ian Danforth wrote:**** >> >> >> >> **** >> >> I will make 3 suggestions. All are out of copyright, well known, >> uncontroversial, and still taught in schools (At least in the US)**** >> >> ** ** >> >> 1. Robinson Crusoe - Daniel Defoe**** >> >> ** ** >> >> http://www.gutenberg.org/ebooks/521**** >> >> ** ** >> >> 2. Great Expectations - Charles Dickens**** >> >> ** ** >> >> http://www.gutenberg.org/ebooks/1400**** >> >> ** ** >> >> 3. The Time Machine - H.G. Wells**** >> >> ** ** >> >> http://www.gutenberg.org/ebooks/35**** >> >> ** ** >> >> Ian**** >> >> ** ** >> >> On Sat, Aug 24, 2013 at 10:24 AM, Francisco Webber <[email protected]> >> wrote:**** >> >> For those who don't want to use the API and for evaluation purposes, I >> would propose that we choose some reference text and I convert it into a >> sequence of SDRs. This file could be used for training.**** >> >> I would also generate a list of all words contained in the text, together >> with their SDRs to be used as conversion table.**** >> >> As a simple test measure we could feed a sequence of SDRs into a trained >> network and see if the HTM makes the right prediction about the following >> word(s). **** >> >> The last file to produce for a complete framework would be a list of lets >> say 100 word sequences with their correct continuation.**** >> >> The word sequences could be for example the beginnings of phrases with >> more than n words (n being the number of steps ahead that the CLA can >> predict ahead)**** >> >> This could be the beginning of a measuring set-up that allows to compare >> different CLA-implementation flavors.**** >> >> ** ** >> >> Any suggestions for a text to choose?**** >> >> ** ** >> >> Francisco**** >> >> ** ** >> >> On 24.08.2013, at 17:12, Matthew Taylor wrote:**** >> >> ** ** >> >> Very cool, Francisco. Here is where you can get cept API credentials: >> https://cept.3scale.net/signup**** >> >> >> **** >> >> ---------**** >> >> Matt Taylor**** >> >> OS Community Flag-Bearer**** >> >> Numenta**** >> >> ** ** >> >> On Fri, Aug 23, 2013 at 5:07 PM, Francisco Webber <[email protected]> >> wrote:**** >> >> Just a short post scriptum: >> >> The public version of our API doesn't actually contain the generic >> conversion function. But if people from the HTM community want to >> experiment just click the "Request for Beta-Program" button and I will >> upgrade your accounts manually. >> >> Francisco**** >> >> >> On 24.08.2013, at 01:59, Francisco Webber wrote: >> >> > Jeff, >> > I thought about this already. >> > We have a REST API where you can send a word in and get the SDR back, >> and vice versa. >> > I invite all who want to experiment to try it out. >> > You just need to get credentials at our website: www.cept.at. >> > >> > In mid-term it would be cool to create some sort of evaluation set, >> that could be used to measure progress while improving the CLA. >> > >> > We are continuously improving our Retina but the version that is >> currently online works pretty well already. >> > >> > I hope that will help >> > >> > Francisco >> > >> > On 24.08.2013, at 01:46, Jeff Hawkins wrote: >> > >> >> Francisco, >> >> Your work is very cool. Do you think it would be possible to make >> available >> >> your word SDRs (or a sufficient subset of them) for experimentation? I >> >> imagine there would be interested in the NuPIC community in training a >> CLA >> >> on text using your word SDRs. You might get some useful results more >> >> quickly. You could do this under a research only license or something >> like >> >> that. >> >> Jeff >> >> >> >> -----Original Message----- >> >> From: nupic [mailto:[email protected]] On Behalf Of >> Francisco >> >> Webber >> >> Sent: Wednesday, August 21, 2013 1:01 PM >> >> To: NuPIC general mailing list. >> >> Subject: Re: [nupic-dev] HTM in Natural Language Processing >> >> >> >> Hello, >> >> I am one of the founders of CEPT Systems and lead researcher of our >> retina >> >> algorithm. >> >> >> >> We have developed a method to represent words by a bitmap pattern >> capturing >> >> most of its "lexical semantics". (A text sensor) Our word-SDRs fulfill >> all >> >> the requirements for "good" HTM input data. >> >> >> >> - Words with similar meaning "look" similar >> >> - If you drop random bits in the representation the semantics remain >> intact >> >> - Only a small number (up to 5%) of bits are set in a word-SDR >> >> - Every bit in the representation corresponds to a specific semantic >> feature >> >> of the language used >> >> - The Retina (sensory organ for a HTM) can be trained on any language >> >> - The retina training process is fully unsupervised. >> >> >> >> We have found out that the word-SDR by itself (without using any HTM >> yet) >> >> can improve many NLP problems that are only poorly solved using the >> >> traditional statistic approaches. >> >> We use the SDRs to: >> >> - Create fingerprints of text documents which allows us to compare >> them for >> >> semantic similarity using simple (euclidian) similarity measures >> >> - We can automatically detect polysemy and disambiguate multiple >> meanings. >> >> - We can characterize any text with context terms for automatic >> >> search-engine query-expansion . >> >> >> >> We hope to successfully link-up our Retina to an HTM network to go >> beyond >> >> lexical semantics into the field of "grammatical semantics". >> >> This would hopefully lead to improved abstracting-, conversation-, >> question >> >> answering- and translation- systems.. >> >> >> >> Our correct web address is www.cept.at (no kangaroos in Vienna ;-) >> >> >> >> I am interested in any form of cooperation to apply HTM technology to >> text. >> >> >> >> Francisco >> >> >> >> On 21.08.2013, at 20:16, Christian Cleber Masdeval Braz wrote: >> >> >> >>> >> >>> Hello. >> >>> >> >>> As many of you here i am prety new in HTM technology. >> >>> >> >>> I am a researcher in Brazil and I am going to start my Phd program >> soon. >> >> My field of interest is NLP and the extraction of knowledge from text. >> I am >> >> thinking to use the ideas behind the Memory Prediction Framework to >> >> investigate semantic information retrieval from the Web, and answer >> >> questions in natural language. I intend to use the HTM implementation >> as >> >> base to do this. >> >>> >> >>> I apreciate a lot if someone could answer some questions: >> >>> >> >>> - Are there some researches related to HTM and NLP? Could indicate >> them? >> >>> >> >>> - Is HTM proper to address this problem? Could it learn, without >> >> supervision, the grammar of a language or just help in some aspects as >> Named >> >> Entity Recognition? >> >>> >> >>> >> >>> >> >>> Regards, >> >>> >> >>> Christian >> >>> >> >>> >> >>> _______________________________________________ >> >>> nupic mailing list >> >>> [email protected] >> >>> http://lists.numenta.org/mailman/listinfo/nupic_lists.numenta.org >> >> >> >> >> >> _______________________________________________ >> >> nupic mailing list >> >> [email protected] >> >> http://lists.numenta.org/mailman/listinfo/nupic_lists.numenta.org >> >> >> >> >> >> _______________________________________________ >> >> nupic mailing list >> >> [email protected] >> >> http://lists.numenta.org/mailman/listinfo/nupic_lists.numenta.org >> > >> > >> > _______________________________________________ >> > nupic mailing list >> > [email protected] >> > http://lists.numenta.org/mailman/listinfo/nupic_lists.numenta.org >> >> >> _______________________________________________ >> nupic mailing list >> [email protected] >> http://lists.numenta.org/mailman/listinfo/nupic_lists.numenta.org**** >> >> ** ** >> >> _______________________________________________ >> nupic mailing list >> [email protected] >> http://lists.numenta.org/mailman/listinfo/nupic_lists.numenta.org**** >> >> ** ** >> >> >> _______________________________________________ >> nupic mailing list >> [email protected] >> http://lists.numenta.org/mailman/listinfo/nupic_lists.numenta.org**** >> >> ** ** >> >> _______________________________________________ >> nupic mailing list >> [email protected] >> http://lists.numenta.org/mailman/listinfo/nupic_lists.numenta.org**** >> >> ** ** >> >> _______________________________________________ >> nupic mailing list >> [email protected] >> http://lists.numenta.org/mailman/listinfo/nupic_lists.numenta.org >> >> > > > -- > James Tauber > http://jtauber.com/ > @jtauber on Twitter > _______________________________________________ > nupic mailing list > [email protected] > http://lists.numenta.org/mailman/listinfo/nupic_lists.numenta.org > > > > _______________________________________________ > nupic mailing list > [email protected] > http://lists.numenta.org/mailman/listinfo/nupic_lists.numenta.org > > -- James Tauber http://jtauber.com/ @jtauber on Twitter
_______________________________________________ nupic mailing list [email protected] http://lists.numenta.org/mailman/listinfo/nupic_lists.numenta.org
