I plan to work on it tonight and will commit the python scripts I write for them to my repo.
James On Wed, Aug 28, 2013 at 2:16 PM, Matthew Taylor <[email protected]> wrote: > If anyone starts to work on tasks in Francisco's list of statistical > characteristics, please reply here so we don't have any duplication of work. > > --------- > Matt Taylor > OS Community Flag-Bearer > Numenta > > > On Wed, Aug 28, 2013 at 10:14 AM, Francisco Webber <[email protected]>wrote: > >> James, Thats great! >> >> A next step would be to calculate some statistical characteristics of the >> collection. >> Typically: >> >> - Size in Bytes of the Collection >> - Size in Bytes of each Document >> - Word count of the Collection (punctuation signs should count a words >> too) >> - Word count of each Document (idem) >> - Wordlist of the Collection (each occurring word has an entry) >> - Wordlist of each Document (idem) >> - Coverage of vocabulary of each Document in percent of the Collection >> vocabulary (maybe also unique vocabulary for each Document) >> >> The last line will tell us if the coverage is evenly distributed over the >> different documents. We might eliminate some of them from the list if they >> don't match. >> >> In the end we could make a script that gives each of the calculated items >> a speaking name, casts it as a constant and generates an include file. This >> makes it easy to create the evaluation code later. >> >> Francisco >> >> >> On 28.08.2013, at 18:47, James Tauber wrote: >> >> I've actually moved the texts to a full-blown GitHub repo: >> >> https://github.com/jtauber/nupic-texts >> >> so feel free to log issues against it if other changes are necessary >> and/or fork and do pull requests if you want to change/add anything. >> >> James >> >> >> On Tue, Aug 27, 2013 at 1:54 PM, James Tauber <[email protected]>wrote: >> >>> All done: >>> >>> https://gist.github.com/jtauber/6347309 >>> >>> >>> >>> >>> On Tue, Aug 27, 2013 at 12:35 PM, James Tauber <[email protected]>wrote: >>> >>>> yep, I'm working on it :-) >>>> >>>> >>>> On Tue, Aug 27, 2013 at 12:29 PM, Francisco Webber <[email protected]>wrote: >>>> >>>>> yes James that looks perfect. >>>>> great job! >>>>> Now we need the other tales in the same format. >>>>> >>>>> Francisco >>>>> >>>>> On 27.08.2013, at 15:14, James Tauber wrote: >>>>> >>>>> Let me know if this is what you had in mind (just the ugly duckling): >>>>> >>>>> https://gist.github.com/jtauber/6347309#file-the_ugly_duckling-txt >>>>> >>>>> I put each paragraph on its own line and separated the sections (that >>>>> formerly were separated by a row of asterisks) with a blank line. >>>>> >>>>> James >>>>> >>>>> >>>>> On Tue, Aug 27, 2013 at 7:59 AM, Francisco De Sousa Webber < >>>>> [email protected]> wrote: >>>>> >>>>>> James, >>>>>> thats great! >>>>>> I think that there are some more preparations necessary: >>>>>> - All CRLF should be removed. Keeping one blank after each full stop. >>>>>> (This makes it easier for most parsers) >>>>>> - The line of asterisks should be replaced by a CRLF to mark the >>>>>> paragraphs. (We never know but we could need paragraph info at some time) >>>>>> - The file as such should be split into single tales. (Whatever >>>>>> experiments we run, if we rerun them with different tales, results become >>>>>> more comparable) >>>>>> - The title should not be written in caps. (Capital letter+Full Stop >>>>>> is interpreted as acronym or middle name instead of a sentence delimiter) >>>>>> >>>>>> Francisco >>>>>> >>>>>> >>>>>> Am 27.08.2013 um 00:22 schrieb James Tauber <[email protected]>: >>>>>> >>>>>> I've removed the metadata, the vocab lists and the illustrations: >>>>>> >>>>>> https://gist.github.com/jtauber/6347309 >>>>>> >>>>>> James >>>>>> >>>>>> >>>>>> On Mon, Aug 26, 2013 at 2:10 PM, Jeff Hawkins >>>>>> <[email protected]>wrote: >>>>>> >>>>>>> I am sold on the kid’s story idea. I looked at the link below and >>>>>>> there is a lot of meta data in this file. It would have to be removed >>>>>>> before feeding to the CLA.**** >>>>>>> >>>>>>> ** ** >>>>>>> >>>>>>> My assumption is that we would need a CLA with more columns than the >>>>>>> standard 2048. How many bits are in your word fingerprints? Could we >>>>>>> make >>>>>>> each bit a column and skip the SP?**** >>>>>>> >>>>>>> Jeff**** >>>>>>> >>>>>>> ** ** >>>>>>> >>>>>>> *From:* nupic [mailto:[email protected]] *On Behalf >>>>>>> Of *Francisco Webber >>>>>>> *Sent:* Monday, August 26, 2013 3:50 AM >>>>>>> >>>>>>> *To:* NuPIC general mailing list. >>>>>>> *Subject:* Re: [nupic-dev] HTM in Natural Language Processing**** >>>>>>> >>>>>>> ** ** >>>>>>> >>>>>>> Ian,**** >>>>>>> >>>>>>> I also thought about something from the Gutenberg repository.**** >>>>>>> >>>>>>> But I think we should start with something from the Kids Shelf.**** >>>>>>> >>>>>>> ** ** >>>>>>> >>>>>>> There are several reasons in my opinion:**** >>>>>>> >>>>>>> ** ** >>>>>>> >>>>>>> - We start experimentation with a full bag of unknown parameters, so >>>>>>> keeping the test material simple would allow us to detect the important >>>>>>> ones sooner. And it is quite some work to create a reliable evaluation >>>>>>> framework, so the size of the data set makes a difference.**** >>>>>>> >>>>>>> - Keeping the text simple and short reduces substantially the >>>>>>> overall vocabulary. If we want people to also evaluate offline, matching >>>>>>> fingerprints can become a lengthy process without an efficient >>>>>>> similarity >>>>>>> engine.**** >>>>>>> >>>>>>> - Another reason is the fact that we don't know how much a given set >>>>>>> of columns (like the 2048 typically used) can absorb information. In >>>>>>> other >>>>>>> words: what is the optimal ratio between a first layer of a text-HTM and >>>>>>> the amount of text.**** >>>>>>> >>>>>>> - Lastly I believe that the sequence in which text is presented to >>>>>>> the CLA is of importance. After all when humans learn information by >>>>>>> reading, they also start from simple to complex language. The amount of >>>>>>> new >>>>>>> vocabulary during training, should be relatively stable (the actual >>>>>>> amount >>>>>>> would probably be linked to the ratio of my previous argument) **** >>>>>>> >>>>>>> ** ** >>>>>>> >>>>>>> So we should build continuously more complex training data sets, >>>>>>> finally ending up with "true" books like the ones you listed.**** >>>>>>> >>>>>>> ** ** >>>>>>> >>>>>>> To start I would suggest something like:**** >>>>>>> >>>>>>> ** ** >>>>>>> >>>>>>> A Primary Reader: Old-time Stories, Fairy Tales and Myths Retold by >>>>>>> Children**** >>>>>>> >>>>>>> http://www.gutenberg.org/ebooks/7841**** >>>>>>> >>>>>>> ** ** >>>>>>> >>>>>>> But there might still be better ones…**** >>>>>>> >>>>>>> ** ** >>>>>>> >>>>>>> Francisco**** >>>>>>> >>>>>>> ** ** >>>>>>> >>>>>>> **** >>>>>>> >>>>>>> ** ** >>>>>>> >>>>>>> On 25.08.2013, at 23:05, Ian Danforth wrote:**** >>>>>>> >>>>>>> >>>>>>> >>>>>>> **** >>>>>>> >>>>>>> I will make 3 suggestions. All are out of copyright, well known, >>>>>>> uncontroversial, and still taught in schools (At least in the US)*** >>>>>>> * >>>>>>> >>>>>>> ** ** >>>>>>> >>>>>>> 1. Robinson Crusoe - Daniel Defoe**** >>>>>>> >>>>>>> ** ** >>>>>>> >>>>>>> http://www.gutenberg.org/ebooks/521**** >>>>>>> >>>>>>> ** ** >>>>>>> >>>>>>> 2. Great Expectations - Charles Dickens**** >>>>>>> >>>>>>> ** ** >>>>>>> >>>>>>> http://www.gutenberg.org/ebooks/1400**** >>>>>>> >>>>>>> ** ** >>>>>>> >>>>>>> 3. The Time Machine - H.G. Wells**** >>>>>>> >>>>>>> ** ** >>>>>>> >>>>>>> http://www.gutenberg.org/ebooks/35**** >>>>>>> >>>>>>> ** ** >>>>>>> >>>>>>> Ian**** >>>>>>> >>>>>>> ** ** >>>>>>> >>>>>>> On Sat, Aug 24, 2013 at 10:24 AM, Francisco Webber <[email protected]> >>>>>>> wrote:**** >>>>>>> >>>>>>> For those who don't want to use the API and for evaluation purposes, >>>>>>> I would propose that we choose some reference text and I convert it >>>>>>> into a >>>>>>> sequence of SDRs. This file could be used for training.**** >>>>>>> >>>>>>> I would also generate a list of all words contained in the text, >>>>>>> together with their SDRs to be used as conversion table.**** >>>>>>> >>>>>>> As a simple test measure we could feed a sequence of SDRs into a >>>>>>> trained network and see if the HTM makes the right prediction about the >>>>>>> following word(s). **** >>>>>>> >>>>>>> The last file to produce for a complete framework would be a list of >>>>>>> lets say 100 word sequences with their correct continuation.**** >>>>>>> >>>>>>> The word sequences could be for example the beginnings of phrases >>>>>>> with more than n words (n being the number of steps ahead that the CLA >>>>>>> can >>>>>>> predict ahead)**** >>>>>>> >>>>>>> This could be the beginning of a measuring set-up that allows to >>>>>>> compare different CLA-implementation flavors.**** >>>>>>> >>>>>>> ** ** >>>>>>> >>>>>>> Any suggestions for a text to choose?**** >>>>>>> >>>>>>> ** ** >>>>>>> >>>>>>> Francisco**** >>>>>>> >>>>>>> ** ** >>>>>>> >>>>>>> On 24.08.2013, at 17:12, Matthew Taylor wrote:**** >>>>>>> >>>>>>> ** ** >>>>>>> >>>>>>> Very cool, Francisco. Here is where you can get cept API >>>>>>> credentials: https://cept.3scale.net/signup**** >>>>>>> >>>>>>> >>>>>>> **** >>>>>>> >>>>>>> ---------**** >>>>>>> >>>>>>> Matt Taylor**** >>>>>>> >>>>>>> OS Community Flag-Bearer**** >>>>>>> >>>>>>> Numenta**** >>>>>>> >>>>>>> ** ** >>>>>>> >>>>>>> On Fri, Aug 23, 2013 at 5:07 PM, Francisco Webber <[email protected]> >>>>>>> wrote:**** >>>>>>> >>>>>>> Just a short post scriptum: >>>>>>> >>>>>>> The public version of our API doesn't actually contain the generic >>>>>>> conversion function. But if people from the HTM community want to >>>>>>> experiment just click the "Request for Beta-Program" button and I will >>>>>>> upgrade your accounts manually. >>>>>>> >>>>>>> Francisco**** >>>>>>> >>>>>>> >>>>>>> On 24.08.2013, at 01:59, Francisco Webber wrote: >>>>>>> >>>>>>> > Jeff, >>>>>>> > I thought about this already. >>>>>>> > We have a REST API where you can send a word in and get the SDR >>>>>>> back, and vice versa. >>>>>>> > I invite all who want to experiment to try it out. >>>>>>> > You just need to get credentials at our website: www.cept.at. >>>>>>> > >>>>>>> > In mid-term it would be cool to create some sort of evaluation >>>>>>> set, that could be used to measure progress while improving the CLA. >>>>>>> > >>>>>>> > We are continuously improving our Retina but the version that is >>>>>>> currently online works pretty well already. >>>>>>> > >>>>>>> > I hope that will help >>>>>>> > >>>>>>> > Francisco >>>>>>> > >>>>>>> > On 24.08.2013, at 01:46, Jeff Hawkins wrote: >>>>>>> > >>>>>>> >> Francisco, >>>>>>> >> Your work is very cool. Do you think it would be possible to >>>>>>> make available >>>>>>> >> your word SDRs (or a sufficient subset of them) for >>>>>>> experimentation? I >>>>>>> >> imagine there would be interested in the NuPIC community in >>>>>>> training a CLA >>>>>>> >> on text using your word SDRs. You might get some useful results >>>>>>> more >>>>>>> >> quickly. You could do this under a research only license or >>>>>>> something like >>>>>>> >> that. >>>>>>> >> Jeff >>>>>>> >> >>>>>>> >> -----Original Message----- >>>>>>> >> From: nupic [mailto:[email protected]] On Behalf >>>>>>> Of Francisco >>>>>>> >> Webber >>>>>>> >> Sent: Wednesday, August 21, 2013 1:01 PM >>>>>>> >> To: NuPIC general mailing list. >>>>>>> >> Subject: Re: [nupic-dev] HTM in Natural Language Processing >>>>>>> >> >>>>>>> >> Hello, >>>>>>> >> I am one of the founders of CEPT Systems and lead researcher of >>>>>>> our retina >>>>>>> >> algorithm. >>>>>>> >> >>>>>>> >> We have developed a method to represent words by a bitmap pattern >>>>>>> capturing >>>>>>> >> most of its "lexical semantics". (A text sensor) Our word-SDRs >>>>>>> fulfill all >>>>>>> >> the requirements for "good" HTM input data. >>>>>>> >> >>>>>>> >> - Words with similar meaning "look" similar >>>>>>> >> - If you drop random bits in the representation the semantics >>>>>>> remain intact >>>>>>> >> - Only a small number (up to 5%) of bits are set in a word-SDR >>>>>>> >> - Every bit in the representation corresponds to a specific >>>>>>> semantic feature >>>>>>> >> of the language used >>>>>>> >> - The Retina (sensory organ for a HTM) can be trained on any >>>>>>> language >>>>>>> >> - The retina training process is fully unsupervised. >>>>>>> >> >>>>>>> >> We have found out that the word-SDR by itself (without using any >>>>>>> HTM yet) >>>>>>> >> can improve many NLP problems that are only poorly solved using >>>>>>> the >>>>>>> >> traditional statistic approaches. >>>>>>> >> We use the SDRs to: >>>>>>> >> - Create fingerprints of text documents which allows us to >>>>>>> compare them for >>>>>>> >> semantic similarity using simple (euclidian) similarity measures >>>>>>> >> - We can automatically detect polysemy and disambiguate multiple >>>>>>> meanings. >>>>>>> >> - We can characterize any text with context terms for automatic >>>>>>> >> search-engine query-expansion . >>>>>>> >> >>>>>>> >> We hope to successfully link-up our Retina to an HTM network to >>>>>>> go beyond >>>>>>> >> lexical semantics into the field of "grammatical semantics". >>>>>>> >> This would hopefully lead to improved abstracting-, >>>>>>> conversation-, question >>>>>>> >> answering- and translation- systems.. >>>>>>> >> >>>>>>> >> Our correct web address is www.cept.at (no kangaroos in Vienna >>>>>>> ;-) >>>>>>> >> >>>>>>> >> I am interested in any form of cooperation to apply HTM >>>>>>> technology to text. >>>>>>> >> >>>>>>> >> Francisco >>>>>>> >> >>>>>>> >> On 21.08.2013, at 20:16, Christian Cleber Masdeval Braz wrote: >>>>>>> >> >>>>>>> >>> >>>>>>> >>> Hello. >>>>>>> >>> >>>>>>> >>> As many of you here i am prety new in HTM technology. >>>>>>> >>> >>>>>>> >>> I am a researcher in Brazil and I am going to start my Phd >>>>>>> program soon. >>>>>>> >> My field of interest is NLP and the extraction of knowledge from >>>>>>> text. I am >>>>>>> >> thinking to use the ideas behind the Memory Prediction Framework >>>>>>> to >>>>>>> >> investigate semantic information retrieval from the Web, and >>>>>>> answer >>>>>>> >> questions in natural language. I intend to use the HTM >>>>>>> implementation as >>>>>>> >> base to do this. >>>>>>> >>> >>>>>>> >>> I apreciate a lot if someone could answer some questions: >>>>>>> >>> >>>>>>> >>> - Are there some researches related to HTM and NLP? Could >>>>>>> indicate them? >>>>>>> >>> >>>>>>> >>> - Is HTM proper to address this problem? Could it learn, without >>>>>>> >> supervision, the grammar of a language or just help in some >>>>>>> aspects as Named >>>>>>> >> Entity Recognition? >>>>>>> >>> >>>>>>> >>> >>>>>>> >>> >>>>>>> >>> Regards, >>>>>>> >>> >>>>>>> >>> Christian >>>>>>> >>> >>>>>>> >>> >>>>>>> >>> _______________________________________________ >>>>>>> >>> nupic mailing list >>>>>>> >>> [email protected] >>>>>>> >>> >>>>>>> http://lists.numenta.org/mailman/listinfo/nupic_lists.numenta.org >>>>>>> >> >>>>>>> >> >>>>>>> >> _______________________________________________ >>>>>>> >> nupic mailing list >>>>>>> >> [email protected] >>>>>>> >> http://lists.numenta.org/mailman/listinfo/nupic_lists.numenta.org >>>>>>> >> >>>>>>> >> >>>>>>> >> _______________________________________________ >>>>>>> >> nupic mailing list >>>>>>> >> [email protected] >>>>>>> >> http://lists.numenta.org/mailman/listinfo/nupic_lists.numenta.org >>>>>>> > >>>>>>> > >>>>>>> > _______________________________________________ >>>>>>> > nupic mailing list >>>>>>> > [email protected] >>>>>>> > http://lists.numenta.org/mailman/listinfo/nupic_lists.numenta.org >>>>>>> >>>>>>> >>>>>>> _______________________________________________ >>>>>>> nupic mailing list >>>>>>> [email protected] >>>>>>> http://lists.numenta.org/mailman/listinfo/nupic_lists.numenta.org*** >>>>>>> * >>>>>>> >>>>>>> ** ** >>>>>>> >>>>>>> _______________________________________________ >>>>>>> nupic mailing list >>>>>>> [email protected] >>>>>>> http://lists.numenta.org/mailman/listinfo/nupic_lists.numenta.org*** >>>>>>> * >>>>>>> >>>>>>> ** ** >>>>>>> >>>>>>> >>>>>>> _______________________________________________ >>>>>>> nupic mailing list >>>>>>> [email protected] >>>>>>> http://lists.numenta.org/mailman/listinfo/nupic_lists.numenta.org*** >>>>>>> * >>>>>>> >>>>>>> ** ** >>>>>>> >>>>>>> _______________________________________________ >>>>>>> nupic mailing list >>>>>>> [email protected] >>>>>>> http://lists.numenta.org/mailman/listinfo/nupic_lists.numenta.org*** >>>>>>> * >>>>>>> >>>>>>> ** ** >>>>>>> >>>>>>> _______________________________________________ >>>>>>> nupic mailing list >>>>>>> [email protected] >>>>>>> http://lists.numenta.org/mailman/listinfo/nupic_lists.numenta.org >>>>>>> >>>>>>> >>>>>> >>>>>> >>>>>> -- >>>>>> James Tauber >>>>>> http://jtauber.com/ >>>>>> @jtauber on Twitter >>>>>> _______________________________________________ >>>>>> nupic mailing list >>>>>> [email protected] >>>>>> http://lists.numenta.org/mailman/listinfo/nupic_lists.numenta.org >>>>>> >>>>>> >>>>>> >>>>>> _______________________________________________ >>>>>> nupic mailing list >>>>>> [email protected] >>>>>> http://lists.numenta.org/mailman/listinfo/nupic_lists.numenta.org >>>>>> >>>>>> >>>>> >>>>> >>>>> -- >>>>> James Tauber >>>>> http://jtauber.com/ >>>>> @jtauber on Twitter >>>>> _______________________________________________ >>>>> nupic mailing list >>>>> [email protected] >>>>> http://lists.numenta.org/mailman/listinfo/nupic_lists.numenta.org >>>>> >>>>> >>>>> >>>>> _______________________________________________ >>>>> nupic mailing list >>>>> [email protected] >>>>> http://lists.numenta.org/mailman/listinfo/nupic_lists.numenta.org >>>>> >>>>> >>>> >>>> >>>> -- >>>> James Tauber >>>> http://jtauber.com/ >>>> @jtauber on Twitter >>>> >>> >>> >>> >>> -- >>> James Tauber >>> http://jtauber.com/ >>> @jtauber on Twitter >>> >> >> >> >> -- >> James Tauber >> http://jtauber.com/ >> @jtauber on Twitter >> _______________________________________________ >> nupic mailing list >> [email protected] >> http://lists.numenta.org/mailman/listinfo/nupic_lists.numenta.org >> >> >> >> _______________________________________________ >> nupic mailing list >> [email protected] >> http://lists.numenta.org/mailman/listinfo/nupic_lists.numenta.org >> >> > > _______________________________________________ > nupic mailing list > [email protected] > http://lists.numenta.org/mailman/listinfo/nupic_lists.numenta.org > > -- James Tauber http://jtauber.com/ @jtauber on Twitter
_______________________________________________ nupic mailing list [email protected] http://lists.numenta.org/mailman/listinfo/nupic_lists.numenta.org
