Ok, I will work on the specs further. Francisco
On 28.08.2013, at 20:21, James Tauber wrote: > I plan to work on it tonight and will commit the python scripts I write for > them to my repo. > > James > > > On Wed, Aug 28, 2013 at 2:16 PM, Matthew Taylor <[email protected]> wrote: > If anyone starts to work on tasks in Francisco's list of statistical > characteristics, please reply here so we don't have any duplication of work. > > --------- > Matt Taylor > OS Community Flag-Bearer > Numenta > > > On Wed, Aug 28, 2013 at 10:14 AM, Francisco Webber <[email protected]> wrote: > James, Thats great! > > A next step would be to calculate some statistical characteristics of the > collection. > Typically: > > - Size in Bytes of the Collection > - Size in Bytes of each Document > - Word count of the Collection (punctuation signs should count a words too) > - Word count of each Document (idem) > - Wordlist of the Collection (each occurring word has an entry) > - Wordlist of each Document (idem) > - Coverage of vocabulary of each Document in percent of the Collection > vocabulary (maybe also unique vocabulary for each Document) > > The last line will tell us if the coverage is evenly distributed over the > different documents. We might eliminate some of them from the list if they > don't match. > > In the end we could make a script that gives each of the calculated items a > speaking name, casts it as a constant and generates an include file. This > makes it easy to create the evaluation code later. > > Francisco > > > On 28.08.2013, at 18:47, James Tauber wrote: > >> I've actually moved the texts to a full-blown GitHub repo: >> >> https://github.com/jtauber/nupic-texts >> >> so feel free to log issues against it if other changes are necessary and/or >> fork and do pull requests if you want to change/add anything. >> >> James >> >> >> On Tue, Aug 27, 2013 at 1:54 PM, James Tauber <[email protected]> wrote: >> All done: >> >> https://gist.github.com/jtauber/6347309 >> >> >> >> >> On Tue, Aug 27, 2013 at 12:35 PM, James Tauber <[email protected]> wrote: >> yep, I'm working on it :-) >> >> >> On Tue, Aug 27, 2013 at 12:29 PM, Francisco Webber <[email protected]> wrote: >> yes James that looks perfect. >> great job! >> Now we need the other tales in the same format. >> >> Francisco >> >> On 27.08.2013, at 15:14, James Tauber wrote: >> >>> Let me know if this is what you had in mind (just the ugly duckling): >>> >>> https://gist.github.com/jtauber/6347309#file-the_ugly_duckling-txt >>> >>> I put each paragraph on its own line and separated the sections (that >>> formerly were separated by a row of asterisks) with a blank line. >>> >>> James >>> >>> >>> On Tue, Aug 27, 2013 at 7:59 AM, Francisco De Sousa Webber >>> <[email protected]> wrote: >>> James, >>> thats great! >>> I think that there are some more preparations necessary: >>> - All CRLF should be removed. Keeping one blank after each full stop. (This >>> makes it easier for most parsers) >>> - The line of asterisks should be replaced by a CRLF to mark the >>> paragraphs. (We never know but we could need paragraph info at some time) >>> - The file as such should be split into single tales. (Whatever experiments >>> we run, if we rerun them with different tales, results become more >>> comparable) >>> - The title should not be written in caps. (Capital letter+Full Stop is >>> interpreted as acronym or middle name instead of a sentence delimiter) >>> >>> Francisco >>> >>> >>> Am 27.08.2013 um 00:22 schrieb James Tauber <[email protected]>: >>> >>>> I've removed the metadata, the vocab lists and the illustrations: >>>> >>>> https://gist.github.com/jtauber/6347309 >>>> >>>> James >>>> >>>> >>>> On Mon, Aug 26, 2013 at 2:10 PM, Jeff Hawkins <[email protected]> wrote: >>>> I am sold on the kid’s story idea. I looked at the link below and there >>>> is a lot of meta data in this file. It would have to be removed before >>>> feeding to the CLA. >>>> >>>> >>>> >>>> My assumption is that we would need a CLA with more columns than the >>>> standard 2048. How many bits are in your word fingerprints? Could we >>>> make each bit a column and skip the SP? >>>> >>>> Jeff >>>> >>>> >>>> >>>> From: nupic [mailto:[email protected]] On Behalf Of >>>> Francisco Webber >>>> Sent: Monday, August 26, 2013 3:50 AM >>>> >>>> >>>> To: NuPIC general mailing list. >>>> Subject: Re: [nupic-dev] HTM in Natural Language Processing >>>> >>>> >>>> >>>> Ian, >>>> >>>> I also thought about something from the Gutenberg repository. >>>> >>>> But I think we should start with something from the Kids Shelf. >>>> >>>> >>>> >>>> There are several reasons in my opinion: >>>> >>>> >>>> >>>> - We start experimentation with a full bag of unknown parameters, so >>>> keeping the test material simple would allow us to detect the important >>>> ones sooner. And it is quite some work to create a reliable evaluation >>>> framework, so the size of the data set makes a difference. >>>> >>>> - Keeping the text simple and short reduces substantially the overall >>>> vocabulary. If we want people to also evaluate offline, matching >>>> fingerprints can become a lengthy process without an efficient similarity >>>> engine. >>>> >>>> - Another reason is the fact that we don't know how much a given set of >>>> columns (like the 2048 typically used) can absorb information. In other >>>> words: what is the optimal ratio between a first layer of a text-HTM and >>>> the amount of text. >>>> >>>> - Lastly I believe that the sequence in which text is presented to the CLA >>>> is of importance. After all when humans learn information by reading, they >>>> also start from simple to complex language. The amount of new vocabulary >>>> during training, should be relatively stable (the actual amount would >>>> probably be linked to the ratio of my previous argument) >>>> >>>> >>>> >>>> So we should build continuously more complex training data sets, finally >>>> ending up with "true" books like the ones you listed. >>>> >>>> >>>> >>>> To start I would suggest something like: >>>> >>>> >>>> >>>> A Primary Reader: Old-time Stories, Fairy Tales and Myths Retold by >>>> Children >>>> >>>> http://www.gutenberg.org/ebooks/7841 >>>> >>>> >>>> >>>> But there might still be better ones… >>>> >>>> >>>> >>>> Francisco >>>> >>>> >>>> >>>> >>>> >>>> >>>> >>>> On 25.08.2013, at 23:05, Ian Danforth wrote: >>>> >>>> >>>> >>>> >>>> I will make 3 suggestions. All are out of copyright, well known, >>>> uncontroversial, and still taught in schools (At least in the US) >>>> >>>> >>>> >>>> 1. Robinson Crusoe - Daniel Defoe >>>> >>>> >>>> >>>> http://www.gutenberg.org/ebooks/521 >>>> >>>> >>>> >>>> 2. Great Expectations - Charles Dickens >>>> >>>> >>>> >>>> http://www.gutenberg.org/ebooks/1400 >>>> >>>> >>>> >>>> 3. The Time Machine - H.G. Wells >>>> >>>> >>>> >>>> http://www.gutenberg.org/ebooks/35 >>>> >>>> >>>> >>>> Ian >>>> >>>> >>>> >>>> On Sat, Aug 24, 2013 at 10:24 AM, Francisco Webber <[email protected]> >>>> wrote: >>>> >>>> For those who don't want to use the API and for evaluation purposes, I >>>> would propose that we choose some reference text and I convert it into a >>>> sequence of SDRs. This file could be used for training. >>>> >>>> I would also generate a list of all words contained in the text, together >>>> with their SDRs to be used as conversion table. >>>> >>>> As a simple test measure we could feed a sequence of SDRs into a trained >>>> network and see if the HTM makes the right prediction about the following >>>> word(s). >>>> >>>> The last file to produce for a complete framework would be a list of lets >>>> say 100 word sequences with their correct continuation. >>>> >>>> The word sequences could be for example the beginnings of phrases with >>>> more than n words (n being the number of steps ahead that the CLA can >>>> predict ahead) >>>> >>>> This could be the beginning of a measuring set-up that allows to compare >>>> different CLA-implementation flavors. >>>> >>>> >>>> >>>> Any suggestions for a text to choose? >>>> >>>> >>>> >>>> Francisco >>>> >>>> >>>> >>>> On 24.08.2013, at 17:12, Matthew Taylor wrote: >>>> >>>> >>>> >>>> Very cool, Francisco. Here is where you can get cept API credentials: >>>> https://cept.3scale.net/signup >>>> >>>> >>>> >>>> --------- >>>> >>>> Matt Taylor >>>> >>>> OS Community Flag-Bearer >>>> >>>> Numenta >>>> >>>> >>>> >>>> On Fri, Aug 23, 2013 at 5:07 PM, Francisco Webber <[email protected]> wrote: >>>> >>>> Just a short post scriptum: >>>> >>>> The public version of our API doesn't actually contain the generic >>>> conversion function. But if people from the HTM community want to >>>> experiment just click the "Request for Beta-Program" button and I will >>>> upgrade your accounts manually. >>>> >>>> Francisco >>>> >>>> >>>> On 24.08.2013, at 01:59, Francisco Webber wrote: >>>> >>>> > Jeff, >>>> > I thought about this already. >>>> > We have a REST API where you can send a word in and get the SDR back, >>>> > and vice versa. >>>> > I invite all who want to experiment to try it out. >>>> > You just need to get credentials at our website: www.cept.at. >>>> > >>>> > In mid-term it would be cool to create some sort of evaluation set, that >>>> > could be used to measure progress while improving the CLA. >>>> > >>>> > We are continuously improving our Retina but the version that is >>>> > currently online works pretty well already. >>>> > >>>> > I hope that will help >>>> > >>>> > Francisco >>>> > >>>> > On 24.08.2013, at 01:46, Jeff Hawkins wrote: >>>> > >>>> >> Francisco, >>>> >> Your work is very cool. Do you think it would be possible to make >>>> >> available >>>> >> your word SDRs (or a sufficient subset of them) for experimentation? I >>>> >> imagine there would be interested in the NuPIC community in training a >>>> >> CLA >>>> >> on text using your word SDRs. You might get some useful results more >>>> >> quickly. You could do this under a research only license or something >>>> >> like >>>> >> that. >>>> >> Jeff >>>> >> >>>> >> -----Original Message----- >>>> >> From: nupic [mailto:[email protected]] On Behalf Of >>>> >> Francisco >>>> >> Webber >>>> >> Sent: Wednesday, August 21, 2013 1:01 PM >>>> >> To: NuPIC general mailing list. >>>> >> Subject: Re: [nupic-dev] HTM in Natural Language Processing >>>> >> >>>> >> Hello, >>>> >> I am one of the founders of CEPT Systems and lead researcher of our >>>> >> retina >>>> >> algorithm. >>>> >> >>>> >> We have developed a method to represent words by a bitmap pattern >>>> >> capturing >>>> >> most of its "lexical semantics". (A text sensor) Our word-SDRs fulfill >>>> >> all >>>> >> the requirements for "good" HTM input data. >>>> >> >>>> >> - Words with similar meaning "look" similar >>>> >> - If you drop random bits in the representation the semantics remain >>>> >> intact >>>> >> - Only a small number (up to 5%) of bits are set in a word-SDR >>>> >> - Every bit in the representation corresponds to a specific semantic >>>> >> feature >>>> >> of the language used >>>> >> - The Retina (sensory organ for a HTM) can be trained on any language >>>> >> - The retina training process is fully unsupervised. >>>> >> >>>> >> We have found out that the word-SDR by itself (without using any HTM >>>> >> yet) >>>> >> can improve many NLP problems that are only poorly solved using the >>>> >> traditional statistic approaches. >>>> >> We use the SDRs to: >>>> >> - Create fingerprints of text documents which allows us to compare them >>>> >> for >>>> >> semantic similarity using simple (euclidian) similarity measures >>>> >> - We can automatically detect polysemy and disambiguate multiple >>>> >> meanings. >>>> >> - We can characterize any text with context terms for automatic >>>> >> search-engine query-expansion . >>>> >> >>>> >> We hope to successfully link-up our Retina to an HTM network to go >>>> >> beyond >>>> >> lexical semantics into the field of "grammatical semantics". >>>> >> This would hopefully lead to improved abstracting-, conversation-, >>>> >> question >>>> >> answering- and translation- systems.. >>>> >> >>>> >> Our correct web address is www.cept.at (no kangaroos in Vienna ;-) >>>> >> >>>> >> I am interested in any form of cooperation to apply HTM technology to >>>> >> text. >>>> >> >>>> >> Francisco >>>> >> >>>> >> On 21.08.2013, at 20:16, Christian Cleber Masdeval Braz wrote: >>>> >> >>>> >>> >>>> >>> Hello. >>>> >>> >>>> >>> As many of you here i am prety new in HTM technology. >>>> >>> >>>> >>> I am a researcher in Brazil and I am going to start my Phd program >>>> >>> soon. >>>> >> My field of interest is NLP and the extraction of knowledge from text. >>>> >> I am >>>> >> thinking to use the ideas behind the Memory Prediction Framework to >>>> >> investigate semantic information retrieval from the Web, and answer >>>> >> questions in natural language. I intend to use the HTM implementation as >>>> >> base to do this. >>>> >>> >>>> >>> I apreciate a lot if someone could answer some questions: >>>> >>> >>>> >>> - Are there some researches related to HTM and NLP? Could indicate >>>> >>> them? >>>> >>> >>>> >>> - Is HTM proper to address this problem? Could it learn, without >>>> >> supervision, the grammar of a language or just help in some aspects as >>>> >> Named >>>> >> Entity Recognition? >>>> >>> >>>> >>> >>>> >>> >>>> >>> Regards, >>>> >>> >>>> >>> Christian >>>> >>> >>>> >>> >>>> >>> _______________________________________________ >>>> >>> nupic mailing list >>>> >>> [email protected] >>>> >>> http://lists.numenta.org/mailman/listinfo/nupic_lists.numenta.org >>>> >> >>>> >> >>>> >> _______________________________________________ >>>> >> nupic mailing list >>>> >> [email protected] >>>> >> http://lists.numenta.org/mailman/listinfo/nupic_lists.numenta.org >>>> >> >>>> >> >>>> >> _______________________________________________ >>>> >> nupic mailing list >>>> >> [email protected] >>>> >> http://lists.numenta.org/mailman/listinfo/nupic_lists.numenta.org >>>> > >>>> > >>>> > _______________________________________________ >>>> > nupic mailing list >>>> > [email protected] >>>> > http://lists.numenta.org/mailman/listinfo/nupic_lists.numenta.org >>>> >>>> >>>> _______________________________________________ >>>> nupic mailing list >>>> [email protected] >>>> http://lists.numenta.org/mailman/listinfo/nupic_lists.numenta.org >>>> >>>> >>>> >>>> _______________________________________________ >>>> nupic mailing list >>>> [email protected] >>>> http://lists.numenta.org/mailman/listinfo/nupic_lists.numenta.org >>>> >>>> >>>> >>>> >>>> _______________________________________________ >>>> nupic mailing list >>>> [email protected] >>>> http://lists.numenta.org/mailman/listinfo/nupic_lists.numenta.org >>>> >>>> >>>> >>>> _______________________________________________ >>>> nupic mailing list >>>> [email protected] >>>> http://lists.numenta.org/mailman/listinfo/nupic_lists.numenta.org >>>> >>>> >>>> >>>> >>>> _______________________________________________ >>>> nupic mailing list >>>> [email protected] >>>> http://lists.numenta.org/mailman/listinfo/nupic_lists.numenta.org >>>> >>>> >>>> >>>> >>>> -- >>>> James Tauber >>>> http://jtauber.com/ >>>> @jtauber on Twitter >>>> _______________________________________________ >>>> nupic mailing list >>>> [email protected] >>>> http://lists.numenta.org/mailman/listinfo/nupic_lists.numenta.org >>> >>> >>> _______________________________________________ >>> nupic mailing list >>> [email protected] >>> http://lists.numenta.org/mailman/listinfo/nupic_lists.numenta.org >>> >>> >>> >>> >>> -- >>> James Tauber >>> http://jtauber.com/ >>> @jtauber on Twitter >>> _______________________________________________ >>> nupic mailing list >>> [email protected] >>> http://lists.numenta.org/mailman/listinfo/nupic_lists.numenta.org >> >> >> _______________________________________________ >> nupic mailing list >> [email protected] >> http://lists.numenta.org/mailman/listinfo/nupic_lists.numenta.org >> >> >> >> >> -- >> James Tauber >> http://jtauber.com/ >> @jtauber on Twitter >> >> >> >> -- >> James Tauber >> http://jtauber.com/ >> @jtauber on Twitter >> >> >> >> -- >> James Tauber >> http://jtauber.com/ >> @jtauber on Twitter >> _______________________________________________ >> nupic mailing list >> [email protected] >> http://lists.numenta.org/mailman/listinfo/nupic_lists.numenta.org > > > _______________________________________________ > nupic mailing list > [email protected] > http://lists.numenta.org/mailman/listinfo/nupic_lists.numenta.org > > > > _______________________________________________ > nupic mailing list > [email protected] > http://lists.numenta.org/mailman/listinfo/nupic_lists.numenta.org > > > > > -- > James Tauber > http://jtauber.com/ > @jtauber on Twitter > _______________________________________________ > nupic mailing list > [email protected] > http://lists.numenta.org/mailman/listinfo/nupic_lists.numenta.org
_______________________________________________ nupic mailing list [email protected] http://lists.numenta.org/mailman/listinfo/nupic_lists.numenta.org
