James,
thats great!
I think that there are some more preparations necessary:
- All CRLF should be removed. Keeping one blank after each full stop. (This 
makes it easier for most parsers)
- The line of asterisks should be replaced by a CRLF to mark the paragraphs. 
(We never know but we could need paragraph info at some time)
- The file as such should be split into single tales. (Whatever experiments we 
run, if we rerun them with different tales, results become more comparable)
- The title should not be written in caps. (Capital letter+Full Stop is 
interpreted as acronym or middle name instead of a sentence delimiter)

Francisco


Am 27.08.2013 um 00:22 schrieb James Tauber <[email protected]>:

> I've removed the metadata, the vocab lists and the illustrations:
> 
> https://gist.github.com/jtauber/6347309
> 
> James
> 
> 
> On Mon, Aug 26, 2013 at 2:10 PM, Jeff Hawkins <[email protected]> wrote:
> I am sold on the kid’s story idea.  I looked at the link below and there is a 
> lot of meta data in this file.  It would have to be removed before feeding to 
> the CLA.
> 
>  
> 
> My assumption is that we would need a CLA with more columns than the standard 
> 2048.  How many bits are in your word fingerprints?  Could we make each bit a 
> column and skip the SP?
> 
> Jeff
> 
>  
> 
> From: nupic [mailto:[email protected]] On Behalf Of Francisco 
> Webber
> Sent: Monday, August 26, 2013 3:50 AM
> 
> 
> To: NuPIC general mailing list.
> Subject: Re: [nupic-dev] HTM in Natural Language Processing
> 
>  
> 
> Ian,
> 
> I also thought about something from the Gutenberg repository.
> 
> But I think we should start with something from the Kids Shelf.
> 
>  
> 
> There are several reasons in my opinion:
> 
>  
> 
> - We start experimentation with a full bag of unknown parameters, so keeping 
> the test material simple would allow us to detect the important ones sooner. 
> And it is quite some work to create a reliable evaluation framework, so the 
> size of the data set makes a difference.
> 
> - Keeping the text simple and short reduces substantially the overall 
> vocabulary. If we want people to also evaluate offline, matching fingerprints 
> can become a lengthy process without an efficient similarity engine.
> 
> - Another reason is the fact that we don't know how much a given set of 
> columns (like the 2048 typically used) can absorb information. In other 
> words: what is the optimal ratio between a first layer of a text-HTM and the 
> amount of text.
> 
> - Lastly I believe that the sequence in which text is presented to the CLA is 
> of importance. After all when humans learn information by reading, they also 
> start from simple to complex language. The amount of new vocabulary during 
> training, should be relatively stable (the actual amount would probably be 
> linked to the ratio of my previous argument) 
> 
>  
> 
> So we should build continuously more complex training data sets, finally 
> ending up with "true"  books like the ones you listed.
> 
>  
> 
> To start I would suggest something like:
> 
>  
> 
> A Primary Reader: Old-time Stories, Fairy Tales and Myths Retold by Children
> 
> http://www.gutenberg.org/ebooks/7841
> 
>  
> 
> But there might still be better ones…
> 
>  
> 
> Francisco
> 
>  
> 
>  
> 
>  
> 
> On 25.08.2013, at 23:05, Ian Danforth wrote:
> 
> 
> 
> 
> I will make 3 suggestions. All are out of copyright, well known, 
> uncontroversial, and still taught in schools (At least in the US)
> 
>  
> 
> 1. Robinson Crusoe - Daniel Defoe
> 
>  
> 
> http://www.gutenberg.org/ebooks/521
> 
>  
> 
> 2. Great Expectations - Charles Dickens
> 
>  
> 
> http://www.gutenberg.org/ebooks/1400
> 
>  
> 
> 3. The Time Machine - H.G. Wells
> 
>  
> 
> http://www.gutenberg.org/ebooks/35
> 
>  
> 
> Ian
> 
>  
> 
> On Sat, Aug 24, 2013 at 10:24 AM, Francisco Webber <[email protected]> wrote:
> 
> For those who don't want to use the API and for evaluation purposes, I would 
> propose that we choose some reference text and I convert it into a sequence 
> of SDRs. This file could be used for training.
> 
> I would also generate a list of all words contained in the text, together 
> with their SDRs to be used as conversion table.
> 
> As a simple test measure we could feed a sequence of SDRs into a trained 
> network and see if the HTM makes the right prediction about the following 
> word(s). 
> 
> The last file to produce for a complete framework would be a list of lets say 
> 100 word sequences with their correct continuation.
> 
> The word sequences could be for example the beginnings of phrases with more 
> than n words (n being the number of steps ahead that the CLA can predict 
> ahead)
> 
> This could be the beginning of a measuring set-up that allows to compare 
> different CLA-implementation flavors.
> 
>  
> 
> Any suggestions for a text to choose?
> 
>  
> 
> Francisco
> 
>  
> 
> On 24.08.2013, at 17:12, Matthew Taylor wrote:
> 
>  
> 
> Very cool, Francisco. Here is where you can get cept API credentials: 
> https://cept.3scale.net/signup
> 
> 
> 
> ---------
> 
> Matt Taylor
> 
> OS Community Flag-Bearer
> 
> Numenta
> 
>  
> 
> On Fri, Aug 23, 2013 at 5:07 PM, Francisco Webber <[email protected]> wrote:
> 
> Just a short post scriptum:
> 
> The public version of our API doesn't actually contain the generic conversion 
> function. But if people from the HTM community want to experiment just click 
> the "Request for Beta-Program" button and I will upgrade your accounts 
> manually.
> 
> Francisco
> 
> 
> On 24.08.2013, at 01:59, Francisco Webber wrote:
> 
> > Jeff,
> > I thought about this already.
> > We have a REST API where you can send a word in and get the SDR back, and 
> > vice versa.
> > I invite all who want to experiment to try it out.
> > You just need to get credentials at our website: www.cept.at.
> >
> > In mid-term it would be cool to create some sort of evaluation set, that 
> > could be used to measure progress while improving the CLA.
> >
> > We are continuously improving our Retina but the version that is currently 
> > online works pretty well already.
> >
> > I hope that will help
> >
> > Francisco
> >
> > On 24.08.2013, at 01:46, Jeff Hawkins wrote:
> >
> >> Francisco,
> >> Your work is very cool.  Do you think it would be possible to make 
> >> available
> >> your word SDRs (or a sufficient subset of them) for experimentation?  I
> >> imagine there would be interested in the NuPIC community in training a CLA
> >> on text using your word SDRs.  You might get some useful results more
> >> quickly.  You could do this under a research only license or something like
> >> that.
> >> Jeff
> >>
> >> -----Original Message-----
> >> From: nupic [mailto:[email protected]] On Behalf Of Francisco
> >> Webber
> >> Sent: Wednesday, August 21, 2013 1:01 PM
> >> To: NuPIC general mailing list.
> >> Subject: Re: [nupic-dev] HTM in Natural Language Processing
> >>
> >> Hello,
> >> I am one of the founders of CEPT Systems and lead researcher of our retina
> >> algorithm.
> >>
> >> We have developed a method to represent words by a bitmap pattern capturing
> >> most of its "lexical semantics". (A text sensor) Our word-SDRs fulfill all
> >> the requirements for "good" HTM input data.
> >>
> >> - Words with similar meaning "look" similar
> >> - If you drop random bits in the representation the semantics remain intact
> >> - Only a small number (up to 5%) of bits are set in a word-SDR
> >> - Every bit in the representation corresponds to a specific semantic 
> >> feature
> >> of the language used
> >> - The Retina (sensory organ for a HTM) can be trained on any language
> >> - The retina training process is fully unsupervised.
> >>
> >> We have found out that the word-SDR by itself (without using any HTM yet)
> >> can improve many NLP problems that are only poorly solved using the
> >> traditional statistic approaches.
> >> We use the SDRs to:
> >> - Create fingerprints of text documents which allows us to compare them for
> >> semantic similarity using simple (euclidian) similarity measures
> >> - We can automatically detect polysemy and disambiguate multiple meanings.
> >> - We can characterize any text with context terms for automatic
> >> search-engine query-expansion .
> >>
> >> We hope to successfully link-up our Retina to an HTM network to go beyond
> >> lexical semantics into the field of "grammatical semantics".
> >> This would hopefully lead to improved abstracting-, conversation-, question
> >> answering- and translation- systems..
> >>
> >> Our correct web address is www.cept.at (no kangaroos in Vienna ;-)
> >>
> >> I am interested in any form of cooperation to apply HTM technology to text.
> >>
> >> Francisco
> >>
> >> On 21.08.2013, at 20:16, Christian Cleber Masdeval Braz wrote:
> >>
> >>>
> >>> Hello.
> >>>
> >>> As many of you here i am prety new in HTM technology.
> >>>
> >>> I am a researcher in Brazil and I am going to start my Phd program soon.
> >> My field of interest is NLP and the extraction of knowledge from text. I am
> >> thinking to use the ideas behind the Memory Prediction Framework to
> >> investigate semantic information retrieval from the Web, and answer
> >> questions in natural language. I intend to use the HTM implementation as
> >> base to do this.
> >>>
> >>> I apreciate a lot if someone could answer some questions:
> >>>
> >>> - Are there some researches related to HTM and NLP? Could indicate them?
> >>>
> >>> - Is HTM proper to address this problem? Could it learn, without
> >> supervision, the grammar of a language or just help in some aspects as 
> >> Named
> >> Entity Recognition?
> >>>
> >>>
> >>>
> >>> Regards,
> >>>
> >>> Christian
> >>>
> >>>
> >>> _______________________________________________
> >>> nupic mailing list
> >>> [email protected]
> >>> http://lists.numenta.org/mailman/listinfo/nupic_lists.numenta.org
> >>
> >>
> >> _______________________________________________
> >> nupic mailing list
> >> [email protected]
> >> http://lists.numenta.org/mailman/listinfo/nupic_lists.numenta.org
> >>
> >>
> >> _______________________________________________
> >> nupic mailing list
> >> [email protected]
> >> http://lists.numenta.org/mailman/listinfo/nupic_lists.numenta.org
> >
> >
> > _______________________________________________
> > nupic mailing list
> > [email protected]
> > http://lists.numenta.org/mailman/listinfo/nupic_lists.numenta.org
> 
> 
> _______________________________________________
> nupic mailing list
> [email protected]
> http://lists.numenta.org/mailman/listinfo/nupic_lists.numenta.org
> 
>  
> 
> _______________________________________________
> nupic mailing list
> [email protected]
> http://lists.numenta.org/mailman/listinfo/nupic_lists.numenta.org
> 
>  
> 
> 
> _______________________________________________
> nupic mailing list
> [email protected]
> http://lists.numenta.org/mailman/listinfo/nupic_lists.numenta.org
> 
>  
> 
> _______________________________________________
> nupic mailing list
> [email protected]
> http://lists.numenta.org/mailman/listinfo/nupic_lists.numenta.org
> 
>  
> 
> 
> _______________________________________________
> nupic mailing list
> [email protected]
> http://lists.numenta.org/mailman/listinfo/nupic_lists.numenta.org
> 
> 
> 
> 
> -- 
> James Tauber
> http://jtauber.com/
> @jtauber on Twitter
> _______________________________________________
> nupic mailing list
> [email protected]
> http://lists.numenta.org/mailman/listinfo/nupic_lists.numenta.org

_______________________________________________
nupic mailing list
[email protected]
http://lists.numenta.org/mailman/listinfo/nupic_lists.numenta.org

Reply via email to