I've actually moved the texts to a full-blown GitHub repo:

https://github.com/jtauber/nupic-texts

so feel free to log issues against it if other changes are necessary and/or
fork and do pull requests if you want to change/add anything.

James


On Tue, Aug 27, 2013 at 1:54 PM, James Tauber <[email protected]> wrote:

> All done:
>
> https://gist.github.com/jtauber/6347309
>
>
>
>
> On Tue, Aug 27, 2013 at 12:35 PM, James Tauber <[email protected]>wrote:
>
>> yep, I'm working on it :-)
>>
>>
>> On Tue, Aug 27, 2013 at 12:29 PM, Francisco Webber <[email protected]>wrote:
>>
>>> yes James that looks perfect.
>>> great job!
>>> Now we need the other tales in the same format.
>>>
>>> Francisco
>>>
>>> On 27.08.2013, at 15:14, James Tauber wrote:
>>>
>>> Let me know if this is what you had in mind (just the ugly duckling):
>>>
>>> https://gist.github.com/jtauber/6347309#file-the_ugly_duckling-txt
>>>
>>> I put each paragraph on its own line and separated the sections (that
>>> formerly were separated by a row of asterisks) with a blank line.
>>>
>>> James
>>>
>>>
>>> On Tue, Aug 27, 2013 at 7:59 AM, Francisco De Sousa Webber <
>>> [email protected]> wrote:
>>>
>>>> James,
>>>> thats great!
>>>> I think that there are some more preparations necessary:
>>>> - All CRLF should be removed. Keeping one blank after each full stop.
>>>> (This makes it easier for most parsers)
>>>> - The line of asterisks should be replaced by a CRLF to mark the
>>>> paragraphs. (We never know but we could need paragraph info at some time)
>>>> - The file as such should be split into single tales. (Whatever
>>>> experiments we run, if we rerun them with different tales, results become
>>>> more comparable)
>>>> - The title should not be written in caps. (Capital letter+Full Stop is
>>>> interpreted as acronym or middle name instead of a sentence delimiter)
>>>>
>>>> Francisco
>>>>
>>>>
>>>> Am 27.08.2013 um 00:22 schrieb James Tauber <[email protected]>:
>>>>
>>>> I've removed the metadata, the vocab lists and the illustrations:
>>>>
>>>> https://gist.github.com/jtauber/6347309
>>>>
>>>> James
>>>>
>>>>
>>>> On Mon, Aug 26, 2013 at 2:10 PM, Jeff Hawkins <[email protected]>wrote:
>>>>
>>>>> I am sold on the kid’s story idea.  I looked at the link below and
>>>>> there is a lot of meta data in this file.  It would have to be removed
>>>>> before feeding to the CLA.****
>>>>>
>>>>> ** **
>>>>>
>>>>> My assumption is that we would need a CLA with more columns than the
>>>>> standard 2048.  How many bits are in your word fingerprints?  Could we 
>>>>> make
>>>>> each bit a column and skip the SP?****
>>>>>
>>>>> Jeff****
>>>>>
>>>>> ** **
>>>>>
>>>>> *From:* nupic [mailto:[email protected]] *On Behalf Of 
>>>>> *Francisco
>>>>> Webber
>>>>> *Sent:* Monday, August 26, 2013 3:50 AM
>>>>>
>>>>> *To:* NuPIC general mailing list.
>>>>> *Subject:* Re: [nupic-dev] HTM in Natural Language Processing****
>>>>>
>>>>> ** **
>>>>>
>>>>> Ian,****
>>>>>
>>>>> I also thought about something from the Gutenberg repository.****
>>>>>
>>>>> But I think we should start with something from the Kids Shelf.****
>>>>>
>>>>> ** **
>>>>>
>>>>> There are several reasons in my opinion:****
>>>>>
>>>>> ** **
>>>>>
>>>>> - We start experimentation with a full bag of unknown parameters, so
>>>>> keeping the test material simple would allow us to detect the important
>>>>> ones sooner. And it is quite some work to create a reliable evaluation
>>>>> framework, so the size of the data set makes a difference.****
>>>>>
>>>>> - Keeping the text simple and short reduces substantially the overall
>>>>> vocabulary. If we want people to also evaluate offline, matching
>>>>> fingerprints can become a lengthy process without an efficient similarity
>>>>> engine.****
>>>>>
>>>>> - Another reason is the fact that we don't know how much a given set
>>>>> of columns (like the 2048 typically used) can absorb information. In other
>>>>> words: what is the optimal ratio between a first layer of a text-HTM and
>>>>> the amount of text.****
>>>>>
>>>>> - Lastly I believe that the sequence in which text is presented to the
>>>>> CLA is of importance. After all when humans learn information by reading,
>>>>> they also start from simple to complex language. The amount of new
>>>>> vocabulary during training, should be relatively stable (the actual amount
>>>>> would probably be linked to the ratio of my previous argument) ****
>>>>>
>>>>> ** **
>>>>>
>>>>> So we should build continuously more complex training data sets,
>>>>> finally ending up with "true"  books like the ones you listed.****
>>>>>
>>>>> ** **
>>>>>
>>>>> To start I would suggest something like:****
>>>>>
>>>>> ** **
>>>>>
>>>>> A Primary Reader: Old-time Stories, Fairy Tales and Myths Retold by
>>>>> Children****
>>>>>
>>>>> http://www.gutenberg.org/ebooks/7841****
>>>>>
>>>>> ** **
>>>>>
>>>>> But there might still be better ones…****
>>>>>
>>>>> ** **
>>>>>
>>>>> Francisco****
>>>>>
>>>>> ** **
>>>>>
>>>>>  ****
>>>>>
>>>>> ** **
>>>>>
>>>>> On 25.08.2013, at 23:05, Ian Danforth wrote:****
>>>>>
>>>>>
>>>>>
>>>>> ****
>>>>>
>>>>> I will make 3 suggestions. All are out of copyright, well known,
>>>>> uncontroversial, and still taught in schools (At least in the US)****
>>>>>
>>>>> ** **
>>>>>
>>>>> 1. Robinson Crusoe - Daniel Defoe****
>>>>>
>>>>> ** **
>>>>>
>>>>> http://www.gutenberg.org/ebooks/521****
>>>>>
>>>>> ** **
>>>>>
>>>>> 2. Great Expectations - Charles Dickens****
>>>>>
>>>>> ** **
>>>>>
>>>>> http://www.gutenberg.org/ebooks/1400****
>>>>>
>>>>> ** **
>>>>>
>>>>> 3. The Time Machine - H.G. Wells****
>>>>>
>>>>> ** **
>>>>>
>>>>> http://www.gutenberg.org/ebooks/35****
>>>>>
>>>>> ** **
>>>>>
>>>>> Ian****
>>>>>
>>>>> ** **
>>>>>
>>>>> On Sat, Aug 24, 2013 at 10:24 AM, Francisco Webber <[email protected]>
>>>>> wrote:****
>>>>>
>>>>> For those who don't want to use the API and for evaluation purposes, I
>>>>> would propose that we choose some reference text and I convert it into a
>>>>> sequence of SDRs. This file could be used for training.****
>>>>>
>>>>> I would also generate a list of all words contained in the text,
>>>>> together with their SDRs to be used as conversion table.****
>>>>>
>>>>> As a simple test measure we could feed a sequence of SDRs into a
>>>>> trained network and see if the HTM makes the right prediction about the
>>>>> following word(s). ****
>>>>>
>>>>> The last file to produce for a complete framework would be a list of
>>>>> lets say 100 word sequences with their correct continuation.****
>>>>>
>>>>> The word sequences could be for example the beginnings of phrases with
>>>>> more than n words (n being the number of steps ahead that the CLA can
>>>>> predict ahead)****
>>>>>
>>>>> This could be the beginning of a measuring set-up that allows to
>>>>> compare different CLA-implementation flavors.****
>>>>>
>>>>> ** **
>>>>>
>>>>> Any suggestions for a text to choose?****
>>>>>
>>>>> ** **
>>>>>
>>>>> Francisco****
>>>>>
>>>>> ** **
>>>>>
>>>>> On 24.08.2013, at 17:12, Matthew Taylor wrote:****
>>>>>
>>>>> ** **
>>>>>
>>>>> Very cool, Francisco. Here is where you can get cept API credentials:
>>>>> https://cept.3scale.net/signup****
>>>>>
>>>>>
>>>>> ****
>>>>>
>>>>> ---------****
>>>>>
>>>>> Matt Taylor****
>>>>>
>>>>> OS Community Flag-Bearer****
>>>>>
>>>>> Numenta****
>>>>>
>>>>> ** **
>>>>>
>>>>> On Fri, Aug 23, 2013 at 5:07 PM, Francisco Webber <[email protected]>
>>>>> wrote:****
>>>>>
>>>>> Just a short post scriptum:
>>>>>
>>>>> The public version of our API doesn't actually contain the generic
>>>>> conversion function. But if people from the HTM community want to
>>>>> experiment just click the "Request for Beta-Program" button and I will
>>>>> upgrade your accounts manually.
>>>>>
>>>>> Francisco****
>>>>>
>>>>>
>>>>> On 24.08.2013, at 01:59, Francisco Webber wrote:
>>>>>
>>>>> > Jeff,
>>>>> > I thought about this already.
>>>>> > We have a REST API where you can send a word in and get the SDR
>>>>> back, and vice versa.
>>>>> > I invite all who want to experiment to try it out.
>>>>> > You just need to get credentials at our website: www.cept.at.
>>>>> >
>>>>> > In mid-term it would be cool to create some sort of evaluation set,
>>>>> that could be used to measure progress while improving the CLA.
>>>>> >
>>>>> > We are continuously improving our Retina but the version that is
>>>>> currently online works pretty well already.
>>>>> >
>>>>> > I hope that will help
>>>>> >
>>>>> > Francisco
>>>>> >
>>>>> > On 24.08.2013, at 01:46, Jeff Hawkins wrote:
>>>>> >
>>>>> >> Francisco,
>>>>> >> Your work is very cool.  Do you think it would be possible to make
>>>>> available
>>>>> >> your word SDRs (or a sufficient subset of them) for
>>>>> experimentation?  I
>>>>> >> imagine there would be interested in the NuPIC community in
>>>>> training a CLA
>>>>> >> on text using your word SDRs.  You might get some useful results
>>>>> more
>>>>> >> quickly.  You could do this under a research only license or
>>>>> something like
>>>>> >> that.
>>>>> >> Jeff
>>>>> >>
>>>>> >> -----Original Message-----
>>>>> >> From: nupic [mailto:[email protected]] On Behalf Of
>>>>> Francisco
>>>>> >> Webber
>>>>> >> Sent: Wednesday, August 21, 2013 1:01 PM
>>>>> >> To: NuPIC general mailing list.
>>>>> >> Subject: Re: [nupic-dev] HTM in Natural Language Processing
>>>>> >>
>>>>> >> Hello,
>>>>> >> I am one of the founders of CEPT Systems and lead researcher of our
>>>>> retina
>>>>> >> algorithm.
>>>>> >>
>>>>> >> We have developed a method to represent words by a bitmap pattern
>>>>> capturing
>>>>> >> most of its "lexical semantics". (A text sensor) Our word-SDRs
>>>>> fulfill all
>>>>> >> the requirements for "good" HTM input data.
>>>>> >>
>>>>> >> - Words with similar meaning "look" similar
>>>>> >> - If you drop random bits in the representation the semantics
>>>>> remain intact
>>>>> >> - Only a small number (up to 5%) of bits are set in a word-SDR
>>>>> >> - Every bit in the representation corresponds to a specific
>>>>> semantic feature
>>>>> >> of the language used
>>>>> >> - The Retina (sensory organ for a HTM) can be trained on any
>>>>> language
>>>>> >> - The retina training process is fully unsupervised.
>>>>> >>
>>>>> >> We have found out that the word-SDR by itself (without using any
>>>>> HTM yet)
>>>>> >> can improve many NLP problems that are only poorly solved using the
>>>>> >> traditional statistic approaches.
>>>>> >> We use the SDRs to:
>>>>> >> - Create fingerprints of text documents which allows us to compare
>>>>> them for
>>>>> >> semantic similarity using simple (euclidian) similarity measures
>>>>> >> - We can automatically detect polysemy and disambiguate multiple
>>>>> meanings.
>>>>> >> - We can characterize any text with context terms for automatic
>>>>> >> search-engine query-expansion .
>>>>> >>
>>>>> >> We hope to successfully link-up our Retina to an HTM network to go
>>>>> beyond
>>>>> >> lexical semantics into the field of "grammatical semantics".
>>>>> >> This would hopefully lead to improved abstracting-, conversation-,
>>>>> question
>>>>> >> answering- and translation- systems..
>>>>> >>
>>>>> >> Our correct web address is www.cept.at (no kangaroos in Vienna ;-)
>>>>> >>
>>>>> >> I am interested in any form of cooperation to apply HTM technology
>>>>> to text.
>>>>> >>
>>>>> >> Francisco
>>>>> >>
>>>>> >> On 21.08.2013, at 20:16, Christian Cleber Masdeval Braz wrote:
>>>>> >>
>>>>> >>>
>>>>> >>> Hello.
>>>>> >>>
>>>>> >>> As many of you here i am prety new in HTM technology.
>>>>> >>>
>>>>> >>> I am a researcher in Brazil and I am going to start my Phd program
>>>>> soon.
>>>>> >> My field of interest is NLP and the extraction of knowledge from
>>>>> text. I am
>>>>> >> thinking to use the ideas behind the Memory Prediction Framework to
>>>>> >> investigate semantic information retrieval from the Web, and answer
>>>>> >> questions in natural language. I intend to use the HTM
>>>>> implementation as
>>>>> >> base to do this.
>>>>> >>>
>>>>> >>> I apreciate a lot if someone could answer some questions:
>>>>> >>>
>>>>> >>> - Are there some researches related to HTM and NLP? Could indicate
>>>>> them?
>>>>> >>>
>>>>> >>> - Is HTM proper to address this problem? Could it learn, without
>>>>> >> supervision, the grammar of a language or just help in some aspects
>>>>> as Named
>>>>> >> Entity Recognition?
>>>>> >>>
>>>>> >>>
>>>>> >>>
>>>>> >>> Regards,
>>>>> >>>
>>>>> >>> Christian
>>>>> >>>
>>>>> >>>
>>>>> >>> _______________________________________________
>>>>> >>> nupic mailing list
>>>>> >>> [email protected]
>>>>> >>> http://lists.numenta.org/mailman/listinfo/nupic_lists.numenta.org
>>>>> >>
>>>>> >>
>>>>> >> _______________________________________________
>>>>> >> nupic mailing list
>>>>> >> [email protected]
>>>>> >> http://lists.numenta.org/mailman/listinfo/nupic_lists.numenta.org
>>>>> >>
>>>>> >>
>>>>> >> _______________________________________________
>>>>> >> nupic mailing list
>>>>> >> [email protected]
>>>>> >> http://lists.numenta.org/mailman/listinfo/nupic_lists.numenta.org
>>>>> >
>>>>> >
>>>>> > _______________________________________________
>>>>> > nupic mailing list
>>>>> > [email protected]
>>>>> > http://lists.numenta.org/mailman/listinfo/nupic_lists.numenta.org
>>>>>
>>>>>
>>>>> _______________________________________________
>>>>> nupic mailing list
>>>>> [email protected]
>>>>> http://lists.numenta.org/mailman/listinfo/nupic_lists.numenta.org****
>>>>>
>>>>> ** **
>>>>>
>>>>> _______________________________________________
>>>>> nupic mailing list
>>>>> [email protected]
>>>>> http://lists.numenta.org/mailman/listinfo/nupic_lists.numenta.org****
>>>>>
>>>>> ** **
>>>>>
>>>>>
>>>>> _______________________________________________
>>>>> nupic mailing list
>>>>> [email protected]
>>>>> http://lists.numenta.org/mailman/listinfo/nupic_lists.numenta.org****
>>>>>
>>>>> ** **
>>>>>
>>>>> _______________________________________________
>>>>> nupic mailing list
>>>>> [email protected]
>>>>> http://lists.numenta.org/mailman/listinfo/nupic_lists.numenta.org****
>>>>>
>>>>> ** **
>>>>>
>>>>> _______________________________________________
>>>>> nupic mailing list
>>>>> [email protected]
>>>>> http://lists.numenta.org/mailman/listinfo/nupic_lists.numenta.org
>>>>>
>>>>>
>>>>
>>>>
>>>> --
>>>> James Tauber
>>>> http://jtauber.com/
>>>> @jtauber on Twitter
>>>>  _______________________________________________
>>>> nupic mailing list
>>>> [email protected]
>>>> http://lists.numenta.org/mailman/listinfo/nupic_lists.numenta.org
>>>>
>>>>
>>>>
>>>> _______________________________________________
>>>> nupic mailing list
>>>> [email protected]
>>>> http://lists.numenta.org/mailman/listinfo/nupic_lists.numenta.org
>>>>
>>>>
>>>
>>>
>>> --
>>> James Tauber
>>> http://jtauber.com/
>>> @jtauber on Twitter
>>>  _______________________________________________
>>> nupic mailing list
>>> [email protected]
>>> http://lists.numenta.org/mailman/listinfo/nupic_lists.numenta.org
>>>
>>>
>>>
>>> _______________________________________________
>>> nupic mailing list
>>> [email protected]
>>> http://lists.numenta.org/mailman/listinfo/nupic_lists.numenta.org
>>>
>>>
>>
>>
>> --
>> James Tauber
>> http://jtauber.com/
>> @jtauber on Twitter
>>
>
>
>
> --
> James Tauber
> http://jtauber.com/
> @jtauber on Twitter
>



-- 
James Tauber
http://jtauber.com/
@jtauber on Twitter
_______________________________________________
nupic mailing list
[email protected]
http://lists.numenta.org/mailman/listinfo/nupic_lists.numenta.org

Reply via email to