yep, I'm working on it :-)
On Tue, Aug 27, 2013 at 12:29 PM, Francisco Webber <[email protected]> wrote: > yes James that looks perfect. > great job! > Now we need the other tales in the same format. > > Francisco > > On 27.08.2013, at 15:14, James Tauber wrote: > > Let me know if this is what you had in mind (just the ugly duckling): > > https://gist.github.com/jtauber/6347309#file-the_ugly_duckling-txt > > I put each paragraph on its own line and separated the sections (that > formerly were separated by a row of asterisks) with a blank line. > > James > > > On Tue, Aug 27, 2013 at 7:59 AM, Francisco De Sousa Webber < > [email protected]> wrote: > >> James, >> thats great! >> I think that there are some more preparations necessary: >> - All CRLF should be removed. Keeping one blank after each full stop. >> (This makes it easier for most parsers) >> - The line of asterisks should be replaced by a CRLF to mark the >> paragraphs. (We never know but we could need paragraph info at some time) >> - The file as such should be split into single tales. (Whatever >> experiments we run, if we rerun them with different tales, results become >> more comparable) >> - The title should not be written in caps. (Capital letter+Full Stop is >> interpreted as acronym or middle name instead of a sentence delimiter) >> >> Francisco >> >> >> Am 27.08.2013 um 00:22 schrieb James Tauber <[email protected]>: >> >> I've removed the metadata, the vocab lists and the illustrations: >> >> https://gist.github.com/jtauber/6347309 >> >> James >> >> >> On Mon, Aug 26, 2013 at 2:10 PM, Jeff Hawkins <[email protected]>wrote: >> >>> I am sold on the kid’s story idea. I looked at the link below and there >>> is a lot of meta data in this file. It would have to be removed before >>> feeding to the CLA.**** >>> >>> ** ** >>> >>> My assumption is that we would need a CLA with more columns than the >>> standard 2048. How many bits are in your word fingerprints? Could we make >>> each bit a column and skip the SP?**** >>> >>> Jeff**** >>> >>> ** ** >>> >>> *From:* nupic [mailto:[email protected]] *On Behalf Of >>> *Francisco >>> Webber >>> *Sent:* Monday, August 26, 2013 3:50 AM >>> >>> *To:* NuPIC general mailing list. >>> *Subject:* Re: [nupic-dev] HTM in Natural Language Processing**** >>> >>> ** ** >>> >>> Ian,**** >>> >>> I also thought about something from the Gutenberg repository.**** >>> >>> But I think we should start with something from the Kids Shelf.**** >>> >>> ** ** >>> >>> There are several reasons in my opinion:**** >>> >>> ** ** >>> >>> - We start experimentation with a full bag of unknown parameters, so >>> keeping the test material simple would allow us to detect the important >>> ones sooner. And it is quite some work to create a reliable evaluation >>> framework, so the size of the data set makes a difference.**** >>> >>> - Keeping the text simple and short reduces substantially the overall >>> vocabulary. If we want people to also evaluate offline, matching >>> fingerprints can become a lengthy process without an efficient similarity >>> engine.**** >>> >>> - Another reason is the fact that we don't know how much a given set of >>> columns (like the 2048 typically used) can absorb information. In other >>> words: what is the optimal ratio between a first layer of a text-HTM and >>> the amount of text.**** >>> >>> - Lastly I believe that the sequence in which text is presented to the >>> CLA is of importance. After all when humans learn information by reading, >>> they also start from simple to complex language. The amount of new >>> vocabulary during training, should be relatively stable (the actual amount >>> would probably be linked to the ratio of my previous argument) **** >>> >>> ** ** >>> >>> So we should build continuously more complex training data sets, finally >>> ending up with "true" books like the ones you listed.**** >>> >>> ** ** >>> >>> To start I would suggest something like:**** >>> >>> ** ** >>> >>> A Primary Reader: Old-time Stories, Fairy Tales and Myths Retold by >>> Children**** >>> >>> http://www.gutenberg.org/ebooks/7841**** >>> >>> ** ** >>> >>> But there might still be better ones…**** >>> >>> ** ** >>> >>> Francisco**** >>> >>> ** ** >>> >>> **** >>> >>> ** ** >>> >>> On 25.08.2013, at 23:05, Ian Danforth wrote:**** >>> >>> >>> >>> **** >>> >>> I will make 3 suggestions. All are out of copyright, well known, >>> uncontroversial, and still taught in schools (At least in the US)**** >>> >>> ** ** >>> >>> 1. Robinson Crusoe - Daniel Defoe**** >>> >>> ** ** >>> >>> http://www.gutenberg.org/ebooks/521**** >>> >>> ** ** >>> >>> 2. Great Expectations - Charles Dickens**** >>> >>> ** ** >>> >>> http://www.gutenberg.org/ebooks/1400**** >>> >>> ** ** >>> >>> 3. The Time Machine - H.G. Wells**** >>> >>> ** ** >>> >>> http://www.gutenberg.org/ebooks/35**** >>> >>> ** ** >>> >>> Ian**** >>> >>> ** ** >>> >>> On Sat, Aug 24, 2013 at 10:24 AM, Francisco Webber <[email protected]> >>> wrote:**** >>> >>> For those who don't want to use the API and for evaluation purposes, I >>> would propose that we choose some reference text and I convert it into a >>> sequence of SDRs. This file could be used for training.**** >>> >>> I would also generate a list of all words contained in the text, >>> together with their SDRs to be used as conversion table.**** >>> >>> As a simple test measure we could feed a sequence of SDRs into a trained >>> network and see if the HTM makes the right prediction about the following >>> word(s). **** >>> >>> The last file to produce for a complete framework would be a list of >>> lets say 100 word sequences with their correct continuation.**** >>> >>> The word sequences could be for example the beginnings of phrases with >>> more than n words (n being the number of steps ahead that the CLA can >>> predict ahead)**** >>> >>> This could be the beginning of a measuring set-up that allows to compare >>> different CLA-implementation flavors.**** >>> >>> ** ** >>> >>> Any suggestions for a text to choose?**** >>> >>> ** ** >>> >>> Francisco**** >>> >>> ** ** >>> >>> On 24.08.2013, at 17:12, Matthew Taylor wrote:**** >>> >>> ** ** >>> >>> Very cool, Francisco. Here is where you can get cept API credentials: >>> https://cept.3scale.net/signup**** >>> >>> >>> **** >>> >>> ---------**** >>> >>> Matt Taylor**** >>> >>> OS Community Flag-Bearer**** >>> >>> Numenta**** >>> >>> ** ** >>> >>> On Fri, Aug 23, 2013 at 5:07 PM, Francisco Webber <[email protected]> >>> wrote:**** >>> >>> Just a short post scriptum: >>> >>> The public version of our API doesn't actually contain the generic >>> conversion function. But if people from the HTM community want to >>> experiment just click the "Request for Beta-Program" button and I will >>> upgrade your accounts manually. >>> >>> Francisco**** >>> >>> >>> On 24.08.2013, at 01:59, Francisco Webber wrote: >>> >>> > Jeff, >>> > I thought about this already. >>> > We have a REST API where you can send a word in and get the SDR back, >>> and vice versa. >>> > I invite all who want to experiment to try it out. >>> > You just need to get credentials at our website: www.cept.at. >>> > >>> > In mid-term it would be cool to create some sort of evaluation set, >>> that could be used to measure progress while improving the CLA. >>> > >>> > We are continuously improving our Retina but the version that is >>> currently online works pretty well already. >>> > >>> > I hope that will help >>> > >>> > Francisco >>> > >>> > On 24.08.2013, at 01:46, Jeff Hawkins wrote: >>> > >>> >> Francisco, >>> >> Your work is very cool. Do you think it would be possible to make >>> available >>> >> your word SDRs (or a sufficient subset of them) for experimentation? >>> I >>> >> imagine there would be interested in the NuPIC community in training >>> a CLA >>> >> on text using your word SDRs. You might get some useful results more >>> >> quickly. You could do this under a research only license or >>> something like >>> >> that. >>> >> Jeff >>> >> >>> >> -----Original Message----- >>> >> From: nupic [mailto:[email protected]] On Behalf Of >>> Francisco >>> >> Webber >>> >> Sent: Wednesday, August 21, 2013 1:01 PM >>> >> To: NuPIC general mailing list. >>> >> Subject: Re: [nupic-dev] HTM in Natural Language Processing >>> >> >>> >> Hello, >>> >> I am one of the founders of CEPT Systems and lead researcher of our >>> retina >>> >> algorithm. >>> >> >>> >> We have developed a method to represent words by a bitmap pattern >>> capturing >>> >> most of its "lexical semantics". (A text sensor) Our word-SDRs >>> fulfill all >>> >> the requirements for "good" HTM input data. >>> >> >>> >> - Words with similar meaning "look" similar >>> >> - If you drop random bits in the representation the semantics remain >>> intact >>> >> - Only a small number (up to 5%) of bits are set in a word-SDR >>> >> - Every bit in the representation corresponds to a specific semantic >>> feature >>> >> of the language used >>> >> - The Retina (sensory organ for a HTM) can be trained on any language >>> >> - The retina training process is fully unsupervised. >>> >> >>> >> We have found out that the word-SDR by itself (without using any HTM >>> yet) >>> >> can improve many NLP problems that are only poorly solved using the >>> >> traditional statistic approaches. >>> >> We use the SDRs to: >>> >> - Create fingerprints of text documents which allows us to compare >>> them for >>> >> semantic similarity using simple (euclidian) similarity measures >>> >> - We can automatically detect polysemy and disambiguate multiple >>> meanings. >>> >> - We can characterize any text with context terms for automatic >>> >> search-engine query-expansion . >>> >> >>> >> We hope to successfully link-up our Retina to an HTM network to go >>> beyond >>> >> lexical semantics into the field of "grammatical semantics". >>> >> This would hopefully lead to improved abstracting-, conversation-, >>> question >>> >> answering- and translation- systems.. >>> >> >>> >> Our correct web address is www.cept.at (no kangaroos in Vienna ;-) >>> >> >>> >> I am interested in any form of cooperation to apply HTM technology to >>> text. >>> >> >>> >> Francisco >>> >> >>> >> On 21.08.2013, at 20:16, Christian Cleber Masdeval Braz wrote: >>> >> >>> >>> >>> >>> Hello. >>> >>> >>> >>> As many of you here i am prety new in HTM technology. >>> >>> >>> >>> I am a researcher in Brazil and I am going to start my Phd program >>> soon. >>> >> My field of interest is NLP and the extraction of knowledge from >>> text. I am >>> >> thinking to use the ideas behind the Memory Prediction Framework to >>> >> investigate semantic information retrieval from the Web, and answer >>> >> questions in natural language. I intend to use the HTM implementation >>> as >>> >> base to do this. >>> >>> >>> >>> I apreciate a lot if someone could answer some questions: >>> >>> >>> >>> - Are there some researches related to HTM and NLP? Could indicate >>> them? >>> >>> >>> >>> - Is HTM proper to address this problem? Could it learn, without >>> >> supervision, the grammar of a language or just help in some aspects >>> as Named >>> >> Entity Recognition? >>> >>> >>> >>> >>> >>> >>> >>> Regards, >>> >>> >>> >>> Christian >>> >>> >>> >>> >>> >>> _______________________________________________ >>> >>> nupic mailing list >>> >>> [email protected] >>> >>> http://lists.numenta.org/mailman/listinfo/nupic_lists.numenta.org >>> >> >>> >> >>> >> _______________________________________________ >>> >> nupic mailing list >>> >> [email protected] >>> >> http://lists.numenta.org/mailman/listinfo/nupic_lists.numenta.org >>> >> >>> >> >>> >> _______________________________________________ >>> >> nupic mailing list >>> >> [email protected] >>> >> http://lists.numenta.org/mailman/listinfo/nupic_lists.numenta.org >>> > >>> > >>> > _______________________________________________ >>> > nupic mailing list >>> > [email protected] >>> > http://lists.numenta.org/mailman/listinfo/nupic_lists.numenta.org >>> >>> >>> _______________________________________________ >>> nupic mailing list >>> [email protected] >>> http://lists.numenta.org/mailman/listinfo/nupic_lists.numenta.org**** >>> >>> ** ** >>> >>> _______________________________________________ >>> nupic mailing list >>> [email protected] >>> http://lists.numenta.org/mailman/listinfo/nupic_lists.numenta.org**** >>> >>> ** ** >>> >>> >>> _______________________________________________ >>> nupic mailing list >>> [email protected] >>> http://lists.numenta.org/mailman/listinfo/nupic_lists.numenta.org**** >>> >>> ** ** >>> >>> _______________________________________________ >>> nupic mailing list >>> [email protected] >>> http://lists.numenta.org/mailman/listinfo/nupic_lists.numenta.org**** >>> >>> ** ** >>> >>> _______________________________________________ >>> nupic mailing list >>> [email protected] >>> http://lists.numenta.org/mailman/listinfo/nupic_lists.numenta.org >>> >>> >> >> >> -- >> James Tauber >> http://jtauber.com/ >> @jtauber on Twitter >> _______________________________________________ >> nupic mailing list >> [email protected] >> http://lists.numenta.org/mailman/listinfo/nupic_lists.numenta.org >> >> >> >> _______________________________________________ >> nupic mailing list >> [email protected] >> http://lists.numenta.org/mailman/listinfo/nupic_lists.numenta.org >> >> > > > -- > James Tauber > http://jtauber.com/ > @jtauber on Twitter > _______________________________________________ > nupic mailing list > [email protected] > http://lists.numenta.org/mailman/listinfo/nupic_lists.numenta.org > > > > _______________________________________________ > nupic mailing list > [email protected] > http://lists.numenta.org/mailman/listinfo/nupic_lists.numenta.org > > -- James Tauber http://jtauber.com/ @jtauber on Twitter
_______________________________________________ nupic mailing list [email protected] http://lists.numenta.org/mailman/listinfo/nupic_lists.numenta.org
