I've removed the metadata, the vocab lists and the illustrations:

https://gist.github.com/jtauber/6347309

James


On Mon, Aug 26, 2013 at 2:10 PM, Jeff Hawkins <[email protected]> wrote:

> I am sold on the kid’s story idea.  I looked at the link below and there
> is a lot of meta data in this file.  It would have to be removed before
> feeding to the CLA.****
>
> ** **
>
> My assumption is that we would need a CLA with more columns than the
> standard 2048.  How many bits are in your word fingerprints?  Could we make
> each bit a column and skip the SP?****
>
> Jeff****
>
> ** **
>
> *From:* nupic [mailto:[email protected]] *On Behalf Of 
> *Francisco
> Webber
> *Sent:* Monday, August 26, 2013 3:50 AM
>
> *To:* NuPIC general mailing list.
> *Subject:* Re: [nupic-dev] HTM in Natural Language Processing****
>
> ** **
>
> Ian,****
>
> I also thought about something from the Gutenberg repository.****
>
> But I think we should start with something from the Kids Shelf.****
>
> ** **
>
> There are several reasons in my opinion:****
>
> ** **
>
> - We start experimentation with a full bag of unknown parameters, so
> keeping the test material simple would allow us to detect the important
> ones sooner. And it is quite some work to create a reliable evaluation
> framework, so the size of the data set makes a difference.****
>
> - Keeping the text simple and short reduces substantially the overall
> vocabulary. If we want people to also evaluate offline, matching
> fingerprints can become a lengthy process without an efficient similarity
> engine.****
>
> - Another reason is the fact that we don't know how much a given set of
> columns (like the 2048 typically used) can absorb information. In other
> words: what is the optimal ratio between a first layer of a text-HTM and
> the amount of text.****
>
> - Lastly I believe that the sequence in which text is presented to the CLA
> is of importance. After all when humans learn information by reading, they
> also start from simple to complex language. The amount of new vocabulary
> during training, should be relatively stable (the actual amount would
> probably be linked to the ratio of my previous argument) ****
>
> ** **
>
> So we should build continuously more complex training data sets, finally
> ending up with "true"  books like the ones you listed.****
>
> ** **
>
> To start I would suggest something like:****
>
> ** **
>
> A Primary Reader: Old-time Stories, Fairy Tales and Myths Retold by
> Children****
>
> http://www.gutenberg.org/ebooks/7841****
>
> ** **
>
> But there might still be better ones…****
>
> ** **
>
> Francisco****
>
> ** **
>
>  ****
>
> ** **
>
> On 25.08.2013, at 23:05, Ian Danforth wrote:****
>
>
>
> ****
>
> I will make 3 suggestions. All are out of copyright, well known,
> uncontroversial, and still taught in schools (At least in the US)****
>
> ** **
>
> 1. Robinson Crusoe - Daniel Defoe****
>
> ** **
>
> http://www.gutenberg.org/ebooks/521****
>
> ** **
>
> 2. Great Expectations - Charles Dickens****
>
> ** **
>
> http://www.gutenberg.org/ebooks/1400****
>
> ** **
>
> 3. The Time Machine - H.G. Wells****
>
> ** **
>
> http://www.gutenberg.org/ebooks/35****
>
> ** **
>
> Ian****
>
> ** **
>
> On Sat, Aug 24, 2013 at 10:24 AM, Francisco Webber <[email protected]>
> wrote:****
>
> For those who don't want to use the API and for evaluation purposes, I
> would propose that we choose some reference text and I convert it into a
> sequence of SDRs. This file could be used for training.****
>
> I would also generate a list of all words contained in the text, together
> with their SDRs to be used as conversion table.****
>
> As a simple test measure we could feed a sequence of SDRs into a trained
> network and see if the HTM makes the right prediction about the following
> word(s). ****
>
> The last file to produce for a complete framework would be a list of lets
> say 100 word sequences with their correct continuation.****
>
> The word sequences could be for example the beginnings of phrases with
> more than n words (n being the number of steps ahead that the CLA can
> predict ahead)****
>
> This could be the beginning of a measuring set-up that allows to compare
> different CLA-implementation flavors.****
>
> ** **
>
> Any suggestions for a text to choose?****
>
> ** **
>
> Francisco****
>
> ** **
>
> On 24.08.2013, at 17:12, Matthew Taylor wrote:****
>
> ** **
>
> Very cool, Francisco. Here is where you can get cept API credentials:
> https://cept.3scale.net/signup****
>
>
> ****
>
> ---------****
>
> Matt Taylor****
>
> OS Community Flag-Bearer****
>
> Numenta****
>
> ** **
>
> On Fri, Aug 23, 2013 at 5:07 PM, Francisco Webber <[email protected]>
> wrote:****
>
> Just a short post scriptum:
>
> The public version of our API doesn't actually contain the generic
> conversion function. But if people from the HTM community want to
> experiment just click the "Request for Beta-Program" button and I will
> upgrade your accounts manually.
>
> Francisco****
>
>
> On 24.08.2013, at 01:59, Francisco Webber wrote:
>
> > Jeff,
> > I thought about this already.
> > We have a REST API where you can send a word in and get the SDR back,
> and vice versa.
> > I invite all who want to experiment to try it out.
> > You just need to get credentials at our website: www.cept.at.
> >
> > In mid-term it would be cool to create some sort of evaluation set, that
> could be used to measure progress while improving the CLA.
> >
> > We are continuously improving our Retina but the version that is
> currently online works pretty well already.
> >
> > I hope that will help
> >
> > Francisco
> >
> > On 24.08.2013, at 01:46, Jeff Hawkins wrote:
> >
> >> Francisco,
> >> Your work is very cool.  Do you think it would be possible to make
> available
> >> your word SDRs (or a sufficient subset of them) for experimentation?  I
> >> imagine there would be interested in the NuPIC community in training a
> CLA
> >> on text using your word SDRs.  You might get some useful results more
> >> quickly.  You could do this under a research only license or something
> like
> >> that.
> >> Jeff
> >>
> >> -----Original Message-----
> >> From: nupic [mailto:[email protected]] On Behalf Of
> Francisco
> >> Webber
> >> Sent: Wednesday, August 21, 2013 1:01 PM
> >> To: NuPIC general mailing list.
> >> Subject: Re: [nupic-dev] HTM in Natural Language Processing
> >>
> >> Hello,
> >> I am one of the founders of CEPT Systems and lead researcher of our
> retina
> >> algorithm.
> >>
> >> We have developed a method to represent words by a bitmap pattern
> capturing
> >> most of its "lexical semantics". (A text sensor) Our word-SDRs fulfill
> all
> >> the requirements for "good" HTM input data.
> >>
> >> - Words with similar meaning "look" similar
> >> - If you drop random bits in the representation the semantics remain
> intact
> >> - Only a small number (up to 5%) of bits are set in a word-SDR
> >> - Every bit in the representation corresponds to a specific semantic
> feature
> >> of the language used
> >> - The Retina (sensory organ for a HTM) can be trained on any language
> >> - The retina training process is fully unsupervised.
> >>
> >> We have found out that the word-SDR by itself (without using any HTM
> yet)
> >> can improve many NLP problems that are only poorly solved using the
> >> traditional statistic approaches.
> >> We use the SDRs to:
> >> - Create fingerprints of text documents which allows us to compare them
> for
> >> semantic similarity using simple (euclidian) similarity measures
> >> - We can automatically detect polysemy and disambiguate multiple
> meanings.
> >> - We can characterize any text with context terms for automatic
> >> search-engine query-expansion .
> >>
> >> We hope to successfully link-up our Retina to an HTM network to go
> beyond
> >> lexical semantics into the field of "grammatical semantics".
> >> This would hopefully lead to improved abstracting-, conversation-,
> question
> >> answering- and translation- systems..
> >>
> >> Our correct web address is www.cept.at (no kangaroos in Vienna ;-)
> >>
> >> I am interested in any form of cooperation to apply HTM technology to
> text.
> >>
> >> Francisco
> >>
> >> On 21.08.2013, at 20:16, Christian Cleber Masdeval Braz wrote:
> >>
> >>>
> >>> Hello.
> >>>
> >>> As many of you here i am prety new in HTM technology.
> >>>
> >>> I am a researcher in Brazil and I am going to start my Phd program
> soon.
> >> My field of interest is NLP and the extraction of knowledge from text.
> I am
> >> thinking to use the ideas behind the Memory Prediction Framework to
> >> investigate semantic information retrieval from the Web, and answer
> >> questions in natural language. I intend to use the HTM implementation as
> >> base to do this.
> >>>
> >>> I apreciate a lot if someone could answer some questions:
> >>>
> >>> - Are there some researches related to HTM and NLP? Could indicate
> them?
> >>>
> >>> - Is HTM proper to address this problem? Could it learn, without
> >> supervision, the grammar of a language or just help in some aspects as
> Named
> >> Entity Recognition?
> >>>
> >>>
> >>>
> >>> Regards,
> >>>
> >>> Christian
> >>>
> >>>
> >>> _______________________________________________
> >>> nupic mailing list
> >>> [email protected]
> >>> http://lists.numenta.org/mailman/listinfo/nupic_lists.numenta.org
> >>
> >>
> >> _______________________________________________
> >> nupic mailing list
> >> [email protected]
> >> http://lists.numenta.org/mailman/listinfo/nupic_lists.numenta.org
> >>
> >>
> >> _______________________________________________
> >> nupic mailing list
> >> [email protected]
> >> http://lists.numenta.org/mailman/listinfo/nupic_lists.numenta.org
> >
> >
> > _______________________________________________
> > nupic mailing list
> > [email protected]
> > http://lists.numenta.org/mailman/listinfo/nupic_lists.numenta.org
>
>
> _______________________________________________
> nupic mailing list
> [email protected]
> http://lists.numenta.org/mailman/listinfo/nupic_lists.numenta.org****
>
> ** **
>
> _______________________________________________
> nupic mailing list
> [email protected]
> http://lists.numenta.org/mailman/listinfo/nupic_lists.numenta.org****
>
> ** **
>
>
> _______________________________________________
> nupic mailing list
> [email protected]
> http://lists.numenta.org/mailman/listinfo/nupic_lists.numenta.org****
>
> ** **
>
> _______________________________________________
> nupic mailing list
> [email protected]
> http://lists.numenta.org/mailman/listinfo/nupic_lists.numenta.org****
>
> ** **
>
> _______________________________________________
> nupic mailing list
> [email protected]
> http://lists.numenta.org/mailman/listinfo/nupic_lists.numenta.org
>
>


-- 
James Tauber
http://jtauber.com/
@jtauber on Twitter
_______________________________________________
nupic mailing list
[email protected]
http://lists.numenta.org/mailman/listinfo/nupic_lists.numenta.org

Reply via email to