I've actually moved the texts to a full-blown GitHub repo: https://github.com/jtauber/nupic-texts
so feel free to log issues against it if other changes are necessary and/or fork and do pull requests if you want to change/add anything. James On Tue, Aug 27, 2013 at 1:54 PM, James Tauber <[email protected]> wrote: > All done: > > https://gist.github.com/jtauber/6347309 > > > > > On Tue, Aug 27, 2013 at 12:35 PM, James Tauber <[email protected]>wrote: > >> yep, I'm working on it :-) >> >> >> On Tue, Aug 27, 2013 at 12:29 PM, Francisco Webber <[email protected]>wrote: >> >>> yes James that looks perfect. >>> great job! >>> Now we need the other tales in the same format. >>> >>> Francisco >>> >>> On 27.08.2013, at 15:14, James Tauber wrote: >>> >>> Let me know if this is what you had in mind (just the ugly duckling): >>> >>> https://gist.github.com/jtauber/6347309#file-the_ugly_duckling-txt >>> >>> I put each paragraph on its own line and separated the sections (that >>> formerly were separated by a row of asterisks) with a blank line. >>> >>> James >>> >>> >>> On Tue, Aug 27, 2013 at 7:59 AM, Francisco De Sousa Webber < >>> [email protected]> wrote: >>> >>>> James, >>>> thats great! >>>> I think that there are some more preparations necessary: >>>> - All CRLF should be removed. Keeping one blank after each full stop. >>>> (This makes it easier for most parsers) >>>> - The line of asterisks should be replaced by a CRLF to mark the >>>> paragraphs. (We never know but we could need paragraph info at some time) >>>> - The file as such should be split into single tales. (Whatever >>>> experiments we run, if we rerun them with different tales, results become >>>> more comparable) >>>> - The title should not be written in caps. (Capital letter+Full Stop is >>>> interpreted as acronym or middle name instead of a sentence delimiter) >>>> >>>> Francisco >>>> >>>> >>>> Am 27.08.2013 um 00:22 schrieb James Tauber <[email protected]>: >>>> >>>> I've removed the metadata, the vocab lists and the illustrations: >>>> >>>> https://gist.github.com/jtauber/6347309 >>>> >>>> James >>>> >>>> >>>> On Mon, Aug 26, 2013 at 2:10 PM, Jeff Hawkins <[email protected]>wrote: >>>> >>>>> I am sold on the kid’s story idea. I looked at the link below and >>>>> there is a lot of meta data in this file. It would have to be removed >>>>> before feeding to the CLA.**** >>>>> >>>>> ** ** >>>>> >>>>> My assumption is that we would need a CLA with more columns than the >>>>> standard 2048. How many bits are in your word fingerprints? Could we >>>>> make >>>>> each bit a column and skip the SP?**** >>>>> >>>>> Jeff**** >>>>> >>>>> ** ** >>>>> >>>>> *From:* nupic [mailto:[email protected]] *On Behalf Of >>>>> *Francisco >>>>> Webber >>>>> *Sent:* Monday, August 26, 2013 3:50 AM >>>>> >>>>> *To:* NuPIC general mailing list. >>>>> *Subject:* Re: [nupic-dev] HTM in Natural Language Processing**** >>>>> >>>>> ** ** >>>>> >>>>> Ian,**** >>>>> >>>>> I also thought about something from the Gutenberg repository.**** >>>>> >>>>> But I think we should start with something from the Kids Shelf.**** >>>>> >>>>> ** ** >>>>> >>>>> There are several reasons in my opinion:**** >>>>> >>>>> ** ** >>>>> >>>>> - We start experimentation with a full bag of unknown parameters, so >>>>> keeping the test material simple would allow us to detect the important >>>>> ones sooner. And it is quite some work to create a reliable evaluation >>>>> framework, so the size of the data set makes a difference.**** >>>>> >>>>> - Keeping the text simple and short reduces substantially the overall >>>>> vocabulary. If we want people to also evaluate offline, matching >>>>> fingerprints can become a lengthy process without an efficient similarity >>>>> engine.**** >>>>> >>>>> - Another reason is the fact that we don't know how much a given set >>>>> of columns (like the 2048 typically used) can absorb information. In other >>>>> words: what is the optimal ratio between a first layer of a text-HTM and >>>>> the amount of text.**** >>>>> >>>>> - Lastly I believe that the sequence in which text is presented to the >>>>> CLA is of importance. After all when humans learn information by reading, >>>>> they also start from simple to complex language. The amount of new >>>>> vocabulary during training, should be relatively stable (the actual amount >>>>> would probably be linked to the ratio of my previous argument) **** >>>>> >>>>> ** ** >>>>> >>>>> So we should build continuously more complex training data sets, >>>>> finally ending up with "true" books like the ones you listed.**** >>>>> >>>>> ** ** >>>>> >>>>> To start I would suggest something like:**** >>>>> >>>>> ** ** >>>>> >>>>> A Primary Reader: Old-time Stories, Fairy Tales and Myths Retold by >>>>> Children**** >>>>> >>>>> http://www.gutenberg.org/ebooks/7841**** >>>>> >>>>> ** ** >>>>> >>>>> But there might still be better ones…**** >>>>> >>>>> ** ** >>>>> >>>>> Francisco**** >>>>> >>>>> ** ** >>>>> >>>>> **** >>>>> >>>>> ** ** >>>>> >>>>> On 25.08.2013, at 23:05, Ian Danforth wrote:**** >>>>> >>>>> >>>>> >>>>> **** >>>>> >>>>> I will make 3 suggestions. All are out of copyright, well known, >>>>> uncontroversial, and still taught in schools (At least in the US)**** >>>>> >>>>> ** ** >>>>> >>>>> 1. Robinson Crusoe - Daniel Defoe**** >>>>> >>>>> ** ** >>>>> >>>>> http://www.gutenberg.org/ebooks/521**** >>>>> >>>>> ** ** >>>>> >>>>> 2. Great Expectations - Charles Dickens**** >>>>> >>>>> ** ** >>>>> >>>>> http://www.gutenberg.org/ebooks/1400**** >>>>> >>>>> ** ** >>>>> >>>>> 3. The Time Machine - H.G. Wells**** >>>>> >>>>> ** ** >>>>> >>>>> http://www.gutenberg.org/ebooks/35**** >>>>> >>>>> ** ** >>>>> >>>>> Ian**** >>>>> >>>>> ** ** >>>>> >>>>> On Sat, Aug 24, 2013 at 10:24 AM, Francisco Webber <[email protected]> >>>>> wrote:**** >>>>> >>>>> For those who don't want to use the API and for evaluation purposes, I >>>>> would propose that we choose some reference text and I convert it into a >>>>> sequence of SDRs. This file could be used for training.**** >>>>> >>>>> I would also generate a list of all words contained in the text, >>>>> together with their SDRs to be used as conversion table.**** >>>>> >>>>> As a simple test measure we could feed a sequence of SDRs into a >>>>> trained network and see if the HTM makes the right prediction about the >>>>> following word(s). **** >>>>> >>>>> The last file to produce for a complete framework would be a list of >>>>> lets say 100 word sequences with their correct continuation.**** >>>>> >>>>> The word sequences could be for example the beginnings of phrases with >>>>> more than n words (n being the number of steps ahead that the CLA can >>>>> predict ahead)**** >>>>> >>>>> This could be the beginning of a measuring set-up that allows to >>>>> compare different CLA-implementation flavors.**** >>>>> >>>>> ** ** >>>>> >>>>> Any suggestions for a text to choose?**** >>>>> >>>>> ** ** >>>>> >>>>> Francisco**** >>>>> >>>>> ** ** >>>>> >>>>> On 24.08.2013, at 17:12, Matthew Taylor wrote:**** >>>>> >>>>> ** ** >>>>> >>>>> Very cool, Francisco. Here is where you can get cept API credentials: >>>>> https://cept.3scale.net/signup**** >>>>> >>>>> >>>>> **** >>>>> >>>>> ---------**** >>>>> >>>>> Matt Taylor**** >>>>> >>>>> OS Community Flag-Bearer**** >>>>> >>>>> Numenta**** >>>>> >>>>> ** ** >>>>> >>>>> On Fri, Aug 23, 2013 at 5:07 PM, Francisco Webber <[email protected]> >>>>> wrote:**** >>>>> >>>>> Just a short post scriptum: >>>>> >>>>> The public version of our API doesn't actually contain the generic >>>>> conversion function. But if people from the HTM community want to >>>>> experiment just click the "Request for Beta-Program" button and I will >>>>> upgrade your accounts manually. >>>>> >>>>> Francisco**** >>>>> >>>>> >>>>> On 24.08.2013, at 01:59, Francisco Webber wrote: >>>>> >>>>> > Jeff, >>>>> > I thought about this already. >>>>> > We have a REST API where you can send a word in and get the SDR >>>>> back, and vice versa. >>>>> > I invite all who want to experiment to try it out. >>>>> > You just need to get credentials at our website: www.cept.at. >>>>> > >>>>> > In mid-term it would be cool to create some sort of evaluation set, >>>>> that could be used to measure progress while improving the CLA. >>>>> > >>>>> > We are continuously improving our Retina but the version that is >>>>> currently online works pretty well already. >>>>> > >>>>> > I hope that will help >>>>> > >>>>> > Francisco >>>>> > >>>>> > On 24.08.2013, at 01:46, Jeff Hawkins wrote: >>>>> > >>>>> >> Francisco, >>>>> >> Your work is very cool. Do you think it would be possible to make >>>>> available >>>>> >> your word SDRs (or a sufficient subset of them) for >>>>> experimentation? I >>>>> >> imagine there would be interested in the NuPIC community in >>>>> training a CLA >>>>> >> on text using your word SDRs. You might get some useful results >>>>> more >>>>> >> quickly. You could do this under a research only license or >>>>> something like >>>>> >> that. >>>>> >> Jeff >>>>> >> >>>>> >> -----Original Message----- >>>>> >> From: nupic [mailto:[email protected]] On Behalf Of >>>>> Francisco >>>>> >> Webber >>>>> >> Sent: Wednesday, August 21, 2013 1:01 PM >>>>> >> To: NuPIC general mailing list. >>>>> >> Subject: Re: [nupic-dev] HTM in Natural Language Processing >>>>> >> >>>>> >> Hello, >>>>> >> I am one of the founders of CEPT Systems and lead researcher of our >>>>> retina >>>>> >> algorithm. >>>>> >> >>>>> >> We have developed a method to represent words by a bitmap pattern >>>>> capturing >>>>> >> most of its "lexical semantics". (A text sensor) Our word-SDRs >>>>> fulfill all >>>>> >> the requirements for "good" HTM input data. >>>>> >> >>>>> >> - Words with similar meaning "look" similar >>>>> >> - If you drop random bits in the representation the semantics >>>>> remain intact >>>>> >> - Only a small number (up to 5%) of bits are set in a word-SDR >>>>> >> - Every bit in the representation corresponds to a specific >>>>> semantic feature >>>>> >> of the language used >>>>> >> - The Retina (sensory organ for a HTM) can be trained on any >>>>> language >>>>> >> - The retina training process is fully unsupervised. >>>>> >> >>>>> >> We have found out that the word-SDR by itself (without using any >>>>> HTM yet) >>>>> >> can improve many NLP problems that are only poorly solved using the >>>>> >> traditional statistic approaches. >>>>> >> We use the SDRs to: >>>>> >> - Create fingerprints of text documents which allows us to compare >>>>> them for >>>>> >> semantic similarity using simple (euclidian) similarity measures >>>>> >> - We can automatically detect polysemy and disambiguate multiple >>>>> meanings. >>>>> >> - We can characterize any text with context terms for automatic >>>>> >> search-engine query-expansion . >>>>> >> >>>>> >> We hope to successfully link-up our Retina to an HTM network to go >>>>> beyond >>>>> >> lexical semantics into the field of "grammatical semantics". >>>>> >> This would hopefully lead to improved abstracting-, conversation-, >>>>> question >>>>> >> answering- and translation- systems.. >>>>> >> >>>>> >> Our correct web address is www.cept.at (no kangaroos in Vienna ;-) >>>>> >> >>>>> >> I am interested in any form of cooperation to apply HTM technology >>>>> to text. >>>>> >> >>>>> >> Francisco >>>>> >> >>>>> >> On 21.08.2013, at 20:16, Christian Cleber Masdeval Braz wrote: >>>>> >> >>>>> >>> >>>>> >>> Hello. >>>>> >>> >>>>> >>> As many of you here i am prety new in HTM technology. >>>>> >>> >>>>> >>> I am a researcher in Brazil and I am going to start my Phd program >>>>> soon. >>>>> >> My field of interest is NLP and the extraction of knowledge from >>>>> text. I am >>>>> >> thinking to use the ideas behind the Memory Prediction Framework to >>>>> >> investigate semantic information retrieval from the Web, and answer >>>>> >> questions in natural language. I intend to use the HTM >>>>> implementation as >>>>> >> base to do this. >>>>> >>> >>>>> >>> I apreciate a lot if someone could answer some questions: >>>>> >>> >>>>> >>> - Are there some researches related to HTM and NLP? Could indicate >>>>> them? >>>>> >>> >>>>> >>> - Is HTM proper to address this problem? Could it learn, without >>>>> >> supervision, the grammar of a language or just help in some aspects >>>>> as Named >>>>> >> Entity Recognition? >>>>> >>> >>>>> >>> >>>>> >>> >>>>> >>> Regards, >>>>> >>> >>>>> >>> Christian >>>>> >>> >>>>> >>> >>>>> >>> _______________________________________________ >>>>> >>> nupic mailing list >>>>> >>> [email protected] >>>>> >>> http://lists.numenta.org/mailman/listinfo/nupic_lists.numenta.org >>>>> >> >>>>> >> >>>>> >> _______________________________________________ >>>>> >> nupic mailing list >>>>> >> [email protected] >>>>> >> http://lists.numenta.org/mailman/listinfo/nupic_lists.numenta.org >>>>> >> >>>>> >> >>>>> >> _______________________________________________ >>>>> >> nupic mailing list >>>>> >> [email protected] >>>>> >> http://lists.numenta.org/mailman/listinfo/nupic_lists.numenta.org >>>>> > >>>>> > >>>>> > _______________________________________________ >>>>> > nupic mailing list >>>>> > [email protected] >>>>> > http://lists.numenta.org/mailman/listinfo/nupic_lists.numenta.org >>>>> >>>>> >>>>> _______________________________________________ >>>>> nupic mailing list >>>>> [email protected] >>>>> http://lists.numenta.org/mailman/listinfo/nupic_lists.numenta.org**** >>>>> >>>>> ** ** >>>>> >>>>> _______________________________________________ >>>>> nupic mailing list >>>>> [email protected] >>>>> http://lists.numenta.org/mailman/listinfo/nupic_lists.numenta.org**** >>>>> >>>>> ** ** >>>>> >>>>> >>>>> _______________________________________________ >>>>> nupic mailing list >>>>> [email protected] >>>>> http://lists.numenta.org/mailman/listinfo/nupic_lists.numenta.org**** >>>>> >>>>> ** ** >>>>> >>>>> _______________________________________________ >>>>> nupic mailing list >>>>> [email protected] >>>>> http://lists.numenta.org/mailman/listinfo/nupic_lists.numenta.org**** >>>>> >>>>> ** ** >>>>> >>>>> _______________________________________________ >>>>> nupic mailing list >>>>> [email protected] >>>>> http://lists.numenta.org/mailman/listinfo/nupic_lists.numenta.org >>>>> >>>>> >>>> >>>> >>>> -- >>>> James Tauber >>>> http://jtauber.com/ >>>> @jtauber on Twitter >>>> _______________________________________________ >>>> nupic mailing list >>>> [email protected] >>>> http://lists.numenta.org/mailman/listinfo/nupic_lists.numenta.org >>>> >>>> >>>> >>>> _______________________________________________ >>>> nupic mailing list >>>> [email protected] >>>> http://lists.numenta.org/mailman/listinfo/nupic_lists.numenta.org >>>> >>>> >>> >>> >>> -- >>> James Tauber >>> http://jtauber.com/ >>> @jtauber on Twitter >>> _______________________________________________ >>> nupic mailing list >>> [email protected] >>> http://lists.numenta.org/mailman/listinfo/nupic_lists.numenta.org >>> >>> >>> >>> _______________________________________________ >>> nupic mailing list >>> [email protected] >>> http://lists.numenta.org/mailman/listinfo/nupic_lists.numenta.org >>> >>> >> >> >> -- >> James Tauber >> http://jtauber.com/ >> @jtauber on Twitter >> > > > > -- > James Tauber > http://jtauber.com/ > @jtauber on Twitter > -- James Tauber http://jtauber.com/ @jtauber on Twitter
_______________________________________________ nupic mailing list [email protected] http://lists.numenta.org/mailman/listinfo/nupic_lists.numenta.org
