If anyone starts to work on tasks in Francisco's list of statistical characteristics, please reply here so we don't have any duplication of work.
--------- Matt Taylor OS Community Flag-Bearer Numenta On Wed, Aug 28, 2013 at 10:14 AM, Francisco Webber <[email protected]> wrote: > James, Thats great! > > A next step would be to calculate some statistical characteristics of the > collection. > Typically: > > - Size in Bytes of the Collection > - Size in Bytes of each Document > - Word count of the Collection (punctuation signs should count a words too) > - Word count of each Document (idem) > - Wordlist of the Collection (each occurring word has an entry) > - Wordlist of each Document (idem) > - Coverage of vocabulary of each Document in percent of the Collection > vocabulary (maybe also unique vocabulary for each Document) > > The last line will tell us if the coverage is evenly distributed over the > different documents. We might eliminate some of them from the list if they > don't match. > > In the end we could make a script that gives each of the calculated items > a speaking name, casts it as a constant and generates an include file. This > makes it easy to create the evaluation code later. > > Francisco > > > On 28.08.2013, at 18:47, James Tauber wrote: > > I've actually moved the texts to a full-blown GitHub repo: > > https://github.com/jtauber/nupic-texts > > so feel free to log issues against it if other changes are necessary > and/or fork and do pull requests if you want to change/add anything. > > James > > > On Tue, Aug 27, 2013 at 1:54 PM, James Tauber <[email protected]> wrote: > >> All done: >> >> https://gist.github.com/jtauber/6347309 >> >> >> >> >> On Tue, Aug 27, 2013 at 12:35 PM, James Tauber <[email protected]>wrote: >> >>> yep, I'm working on it :-) >>> >>> >>> On Tue, Aug 27, 2013 at 12:29 PM, Francisco Webber <[email protected]>wrote: >>> >>>> yes James that looks perfect. >>>> great job! >>>> Now we need the other tales in the same format. >>>> >>>> Francisco >>>> >>>> On 27.08.2013, at 15:14, James Tauber wrote: >>>> >>>> Let me know if this is what you had in mind (just the ugly duckling): >>>> >>>> https://gist.github.com/jtauber/6347309#file-the_ugly_duckling-txt >>>> >>>> I put each paragraph on its own line and separated the sections (that >>>> formerly were separated by a row of asterisks) with a blank line. >>>> >>>> James >>>> >>>> >>>> On Tue, Aug 27, 2013 at 7:59 AM, Francisco De Sousa Webber < >>>> [email protected]> wrote: >>>> >>>>> James, >>>>> thats great! >>>>> I think that there are some more preparations necessary: >>>>> - All CRLF should be removed. Keeping one blank after each full stop. >>>>> (This makes it easier for most parsers) >>>>> - The line of asterisks should be replaced by a CRLF to mark the >>>>> paragraphs. (We never know but we could need paragraph info at some time) >>>>> - The file as such should be split into single tales. (Whatever >>>>> experiments we run, if we rerun them with different tales, results become >>>>> more comparable) >>>>> - The title should not be written in caps. (Capital letter+Full Stop >>>>> is interpreted as acronym or middle name instead of a sentence delimiter) >>>>> >>>>> Francisco >>>>> >>>>> >>>>> Am 27.08.2013 um 00:22 schrieb James Tauber <[email protected]>: >>>>> >>>>> I've removed the metadata, the vocab lists and the illustrations: >>>>> >>>>> https://gist.github.com/jtauber/6347309 >>>>> >>>>> James >>>>> >>>>> >>>>> On Mon, Aug 26, 2013 at 2:10 PM, Jeff Hawkins <[email protected]>wrote: >>>>> >>>>>> I am sold on the kid’s story idea. I looked at the link below and >>>>>> there is a lot of meta data in this file. It would have to be removed >>>>>> before feeding to the CLA.**** >>>>>> >>>>>> ** ** >>>>>> >>>>>> My assumption is that we would need a CLA with more columns than the >>>>>> standard 2048. How many bits are in your word fingerprints? Could we >>>>>> make >>>>>> each bit a column and skip the SP?**** >>>>>> >>>>>> Jeff**** >>>>>> >>>>>> ** ** >>>>>> >>>>>> *From:* nupic [mailto:[email protected]] *On Behalf Of >>>>>> *Francisco Webber >>>>>> *Sent:* Monday, August 26, 2013 3:50 AM >>>>>> >>>>>> *To:* NuPIC general mailing list. >>>>>> *Subject:* Re: [nupic-dev] HTM in Natural Language Processing**** >>>>>> >>>>>> ** ** >>>>>> >>>>>> Ian,**** >>>>>> >>>>>> I also thought about something from the Gutenberg repository.**** >>>>>> >>>>>> But I think we should start with something from the Kids Shelf.**** >>>>>> >>>>>> ** ** >>>>>> >>>>>> There are several reasons in my opinion:**** >>>>>> >>>>>> ** ** >>>>>> >>>>>> - We start experimentation with a full bag of unknown parameters, so >>>>>> keeping the test material simple would allow us to detect the important >>>>>> ones sooner. And it is quite some work to create a reliable evaluation >>>>>> framework, so the size of the data set makes a difference.**** >>>>>> >>>>>> - Keeping the text simple and short reduces substantially the overall >>>>>> vocabulary. If we want people to also evaluate offline, matching >>>>>> fingerprints can become a lengthy process without an efficient similarity >>>>>> engine.**** >>>>>> >>>>>> - Another reason is the fact that we don't know how much a given set >>>>>> of columns (like the 2048 typically used) can absorb information. In >>>>>> other >>>>>> words: what is the optimal ratio between a first layer of a text-HTM and >>>>>> the amount of text.**** >>>>>> >>>>>> - Lastly I believe that the sequence in which text is presented to >>>>>> the CLA is of importance. After all when humans learn information by >>>>>> reading, they also start from simple to complex language. The amount of >>>>>> new >>>>>> vocabulary during training, should be relatively stable (the actual >>>>>> amount >>>>>> would probably be linked to the ratio of my previous argument) **** >>>>>> >>>>>> ** ** >>>>>> >>>>>> So we should build continuously more complex training data sets, >>>>>> finally ending up with "true" books like the ones you listed.**** >>>>>> >>>>>> ** ** >>>>>> >>>>>> To start I would suggest something like:**** >>>>>> >>>>>> ** ** >>>>>> >>>>>> A Primary Reader: Old-time Stories, Fairy Tales and Myths Retold by >>>>>> Children**** >>>>>> >>>>>> http://www.gutenberg.org/ebooks/7841**** >>>>>> >>>>>> ** ** >>>>>> >>>>>> But there might still be better ones…**** >>>>>> >>>>>> ** ** >>>>>> >>>>>> Francisco**** >>>>>> >>>>>> ** ** >>>>>> >>>>>> **** >>>>>> >>>>>> ** ** >>>>>> >>>>>> On 25.08.2013, at 23:05, Ian Danforth wrote:**** >>>>>> >>>>>> >>>>>> >>>>>> **** >>>>>> >>>>>> I will make 3 suggestions. All are out of copyright, well known, >>>>>> uncontroversial, and still taught in schools (At least in the US)**** >>>>>> >>>>>> ** ** >>>>>> >>>>>> 1. Robinson Crusoe - Daniel Defoe**** >>>>>> >>>>>> ** ** >>>>>> >>>>>> http://www.gutenberg.org/ebooks/521**** >>>>>> >>>>>> ** ** >>>>>> >>>>>> 2. Great Expectations - Charles Dickens**** >>>>>> >>>>>> ** ** >>>>>> >>>>>> http://www.gutenberg.org/ebooks/1400**** >>>>>> >>>>>> ** ** >>>>>> >>>>>> 3. The Time Machine - H.G. Wells**** >>>>>> >>>>>> ** ** >>>>>> >>>>>> http://www.gutenberg.org/ebooks/35**** >>>>>> >>>>>> ** ** >>>>>> >>>>>> Ian**** >>>>>> >>>>>> ** ** >>>>>> >>>>>> On Sat, Aug 24, 2013 at 10:24 AM, Francisco Webber <[email protected]> >>>>>> wrote:**** >>>>>> >>>>>> For those who don't want to use the API and for evaluation purposes, >>>>>> I would propose that we choose some reference text and I convert it into >>>>>> a >>>>>> sequence of SDRs. This file could be used for training.**** >>>>>> >>>>>> I would also generate a list of all words contained in the text, >>>>>> together with their SDRs to be used as conversion table.**** >>>>>> >>>>>> As a simple test measure we could feed a sequence of SDRs into a >>>>>> trained network and see if the HTM makes the right prediction about the >>>>>> following word(s). **** >>>>>> >>>>>> The last file to produce for a complete framework would be a list of >>>>>> lets say 100 word sequences with their correct continuation.**** >>>>>> >>>>>> The word sequences could be for example the beginnings of phrases >>>>>> with more than n words (n being the number of steps ahead that the CLA >>>>>> can >>>>>> predict ahead)**** >>>>>> >>>>>> This could be the beginning of a measuring set-up that allows to >>>>>> compare different CLA-implementation flavors.**** >>>>>> >>>>>> ** ** >>>>>> >>>>>> Any suggestions for a text to choose?**** >>>>>> >>>>>> ** ** >>>>>> >>>>>> Francisco**** >>>>>> >>>>>> ** ** >>>>>> >>>>>> On 24.08.2013, at 17:12, Matthew Taylor wrote:**** >>>>>> >>>>>> ** ** >>>>>> >>>>>> Very cool, Francisco. Here is where you can get cept API credentials: >>>>>> https://cept.3scale.net/signup**** >>>>>> >>>>>> >>>>>> **** >>>>>> >>>>>> ---------**** >>>>>> >>>>>> Matt Taylor**** >>>>>> >>>>>> OS Community Flag-Bearer**** >>>>>> >>>>>> Numenta**** >>>>>> >>>>>> ** ** >>>>>> >>>>>> On Fri, Aug 23, 2013 at 5:07 PM, Francisco Webber <[email protected]> >>>>>> wrote:**** >>>>>> >>>>>> Just a short post scriptum: >>>>>> >>>>>> The public version of our API doesn't actually contain the generic >>>>>> conversion function. But if people from the HTM community want to >>>>>> experiment just click the "Request for Beta-Program" button and I will >>>>>> upgrade your accounts manually. >>>>>> >>>>>> Francisco**** >>>>>> >>>>>> >>>>>> On 24.08.2013, at 01:59, Francisco Webber wrote: >>>>>> >>>>>> > Jeff, >>>>>> > I thought about this already. >>>>>> > We have a REST API where you can send a word in and get the SDR >>>>>> back, and vice versa. >>>>>> > I invite all who want to experiment to try it out. >>>>>> > You just need to get credentials at our website: www.cept.at. >>>>>> > >>>>>> > In mid-term it would be cool to create some sort of evaluation set, >>>>>> that could be used to measure progress while improving the CLA. >>>>>> > >>>>>> > We are continuously improving our Retina but the version that is >>>>>> currently online works pretty well already. >>>>>> > >>>>>> > I hope that will help >>>>>> > >>>>>> > Francisco >>>>>> > >>>>>> > On 24.08.2013, at 01:46, Jeff Hawkins wrote: >>>>>> > >>>>>> >> Francisco, >>>>>> >> Your work is very cool. Do you think it would be possible to make >>>>>> available >>>>>> >> your word SDRs (or a sufficient subset of them) for >>>>>> experimentation? I >>>>>> >> imagine there would be interested in the NuPIC community in >>>>>> training a CLA >>>>>> >> on text using your word SDRs. You might get some useful results >>>>>> more >>>>>> >> quickly. You could do this under a research only license or >>>>>> something like >>>>>> >> that. >>>>>> >> Jeff >>>>>> >> >>>>>> >> -----Original Message----- >>>>>> >> From: nupic [mailto:[email protected]] On Behalf Of >>>>>> Francisco >>>>>> >> Webber >>>>>> >> Sent: Wednesday, August 21, 2013 1:01 PM >>>>>> >> To: NuPIC general mailing list. >>>>>> >> Subject: Re: [nupic-dev] HTM in Natural Language Processing >>>>>> >> >>>>>> >> Hello, >>>>>> >> I am one of the founders of CEPT Systems and lead researcher of >>>>>> our retina >>>>>> >> algorithm. >>>>>> >> >>>>>> >> We have developed a method to represent words by a bitmap pattern >>>>>> capturing >>>>>> >> most of its "lexical semantics". (A text sensor) Our word-SDRs >>>>>> fulfill all >>>>>> >> the requirements for "good" HTM input data. >>>>>> >> >>>>>> >> - Words with similar meaning "look" similar >>>>>> >> - If you drop random bits in the representation the semantics >>>>>> remain intact >>>>>> >> - Only a small number (up to 5%) of bits are set in a word-SDR >>>>>> >> - Every bit in the representation corresponds to a specific >>>>>> semantic feature >>>>>> >> of the language used >>>>>> >> - The Retina (sensory organ for a HTM) can be trained on any >>>>>> language >>>>>> >> - The retina training process is fully unsupervised. >>>>>> >> >>>>>> >> We have found out that the word-SDR by itself (without using any >>>>>> HTM yet) >>>>>> >> can improve many NLP problems that are only poorly solved using the >>>>>> >> traditional statistic approaches. >>>>>> >> We use the SDRs to: >>>>>> >> - Create fingerprints of text documents which allows us to compare >>>>>> them for >>>>>> >> semantic similarity using simple (euclidian) similarity measures >>>>>> >> - We can automatically detect polysemy and disambiguate multiple >>>>>> meanings. >>>>>> >> - We can characterize any text with context terms for automatic >>>>>> >> search-engine query-expansion . >>>>>> >> >>>>>> >> We hope to successfully link-up our Retina to an HTM network to go >>>>>> beyond >>>>>> >> lexical semantics into the field of "grammatical semantics". >>>>>> >> This would hopefully lead to improved abstracting-, conversation-, >>>>>> question >>>>>> >> answering- and translation- systems.. >>>>>> >> >>>>>> >> Our correct web address is www.cept.at (no kangaroos in Vienna ;-) >>>>>> >> >>>>>> >> I am interested in any form of cooperation to apply HTM technology >>>>>> to text. >>>>>> >> >>>>>> >> Francisco >>>>>> >> >>>>>> >> On 21.08.2013, at 20:16, Christian Cleber Masdeval Braz wrote: >>>>>> >> >>>>>> >>> >>>>>> >>> Hello. >>>>>> >>> >>>>>> >>> As many of you here i am prety new in HTM technology. >>>>>> >>> >>>>>> >>> I am a researcher in Brazil and I am going to start my Phd >>>>>> program soon. >>>>>> >> My field of interest is NLP and the extraction of knowledge from >>>>>> text. I am >>>>>> >> thinking to use the ideas behind the Memory Prediction Framework to >>>>>> >> investigate semantic information retrieval from the Web, and answer >>>>>> >> questions in natural language. I intend to use the HTM >>>>>> implementation as >>>>>> >> base to do this. >>>>>> >>> >>>>>> >>> I apreciate a lot if someone could answer some questions: >>>>>> >>> >>>>>> >>> - Are there some researches related to HTM and NLP? Could >>>>>> indicate them? >>>>>> >>> >>>>>> >>> - Is HTM proper to address this problem? Could it learn, without >>>>>> >> supervision, the grammar of a language or just help in some >>>>>> aspects as Named >>>>>> >> Entity Recognition? >>>>>> >>> >>>>>> >>> >>>>>> >>> >>>>>> >>> Regards, >>>>>> >>> >>>>>> >>> Christian >>>>>> >>> >>>>>> >>> >>>>>> >>> _______________________________________________ >>>>>> >>> nupic mailing list >>>>>> >>> [email protected] >>>>>> >>> http://lists.numenta.org/mailman/listinfo/nupic_lists.numenta.org >>>>>> >> >>>>>> >> >>>>>> >> _______________________________________________ >>>>>> >> nupic mailing list >>>>>> >> [email protected] >>>>>> >> http://lists.numenta.org/mailman/listinfo/nupic_lists.numenta.org >>>>>> >> >>>>>> >> >>>>>> >> _______________________________________________ >>>>>> >> nupic mailing list >>>>>> >> [email protected] >>>>>> >> http://lists.numenta.org/mailman/listinfo/nupic_lists.numenta.org >>>>>> > >>>>>> > >>>>>> > _______________________________________________ >>>>>> > nupic mailing list >>>>>> > [email protected] >>>>>> > http://lists.numenta.org/mailman/listinfo/nupic_lists.numenta.org >>>>>> >>>>>> >>>>>> _______________________________________________ >>>>>> nupic mailing list >>>>>> [email protected] >>>>>> http://lists.numenta.org/mailman/listinfo/nupic_lists.numenta.org**** >>>>>> >>>>>> ** ** >>>>>> >>>>>> _______________________________________________ >>>>>> nupic mailing list >>>>>> [email protected] >>>>>> http://lists.numenta.org/mailman/listinfo/nupic_lists.numenta.org**** >>>>>> >>>>>> ** ** >>>>>> >>>>>> >>>>>> _______________________________________________ >>>>>> nupic mailing list >>>>>> [email protected] >>>>>> http://lists.numenta.org/mailman/listinfo/nupic_lists.numenta.org**** >>>>>> >>>>>> ** ** >>>>>> >>>>>> _______________________________________________ >>>>>> nupic mailing list >>>>>> [email protected] >>>>>> http://lists.numenta.org/mailman/listinfo/nupic_lists.numenta.org**** >>>>>> >>>>>> ** ** >>>>>> >>>>>> _______________________________________________ >>>>>> nupic mailing list >>>>>> [email protected] >>>>>> http://lists.numenta.org/mailman/listinfo/nupic_lists.numenta.org >>>>>> >>>>>> >>>>> >>>>> >>>>> -- >>>>> James Tauber >>>>> http://jtauber.com/ >>>>> @jtauber on Twitter >>>>> _______________________________________________ >>>>> nupic mailing list >>>>> [email protected] >>>>> http://lists.numenta.org/mailman/listinfo/nupic_lists.numenta.org >>>>> >>>>> >>>>> >>>>> _______________________________________________ >>>>> nupic mailing list >>>>> [email protected] >>>>> http://lists.numenta.org/mailman/listinfo/nupic_lists.numenta.org >>>>> >>>>> >>>> >>>> >>>> -- >>>> James Tauber >>>> http://jtauber.com/ >>>> @jtauber on Twitter >>>> _______________________________________________ >>>> nupic mailing list >>>> [email protected] >>>> http://lists.numenta.org/mailman/listinfo/nupic_lists.numenta.org >>>> >>>> >>>> >>>> _______________________________________________ >>>> nupic mailing list >>>> [email protected] >>>> http://lists.numenta.org/mailman/listinfo/nupic_lists.numenta.org >>>> >>>> >>> >>> >>> -- >>> James Tauber >>> http://jtauber.com/ >>> @jtauber on Twitter >>> >> >> >> >> -- >> James Tauber >> http://jtauber.com/ >> @jtauber on Twitter >> > > > > -- > James Tauber > http://jtauber.com/ > @jtauber on Twitter > _______________________________________________ > nupic mailing list > [email protected] > http://lists.numenta.org/mailman/listinfo/nupic_lists.numenta.org > > > > _______________________________________________ > nupic mailing list > [email protected] > http://lists.numenta.org/mailman/listinfo/nupic_lists.numenta.org > >
_______________________________________________ nupic mailing list [email protected] http://lists.numenta.org/mailman/listinfo/nupic_lists.numenta.org
