Re: [nupic-dev] HTM in Natural Language Processing

James Tauber Tue, 27 Aug 2013 09:37:09 -0700

yep, I'm working on it :-)


On Tue, Aug 27, 2013 at 12:29 PM, Francisco Webber <[email protected]> wrote:

> yes James that looks perfect.
> great job!
> Now we need the other tales in the same format.
>
> Francisco
>
> On 27.08.2013, at 15:14, James Tauber wrote:
>
> Let me know if this is what you had in mind (just the ugly duckling):
>
> https://gist.github.com/jtauber/6347309#file-the_ugly_duckling-txt
>
> I put each paragraph on its own line and separated the sections (that
> formerly were separated by a row of asterisks) with a blank line.
>
> James
>
>
> On Tue, Aug 27, 2013 at 7:59 AM, Francisco De Sousa Webber <
> [email protected]> wrote:
>
>> James,
>> thats great!
>> I think that there are some more preparations necessary:
>> - All CRLF should be removed. Keeping one blank after each full stop.
>> (This makes it easier for most parsers)
>> - The line of asterisks should be replaced by a CRLF to mark the
>> paragraphs. (We never know but we could need paragraph info at some time)
>> - The file as such should be split into single tales. (Whatever
>> experiments we run, if we rerun them with different tales, results become
>> more comparable)
>> - The title should not be written in caps. (Capital letter+Full Stop is
>> interpreted as acronym or middle name instead of a sentence delimiter)
>>
>> Francisco
>>
>>
>> Am 27.08.2013 um 00:22 schrieb James Tauber <[email protected]>:
>>
>> I've removed the metadata, the vocab lists and the illustrations:
>>
>> https://gist.github.com/jtauber/6347309
>>
>> James
>>
>>
>> On Mon, Aug 26, 2013 at 2:10 PM, Jeff Hawkins <[email protected]>wrote:
>>
>>> I am sold on the kid’s story idea.  I looked at the link below and there
>>> is a lot of meta data in this file.  It would have to be removed before
>>> feeding to the CLA.****
>>>
>>> ** **
>>>
>>> My assumption is that we would need a CLA with more columns than the
>>> standard 2048.  How many bits are in your word fingerprints?  Could we make
>>> each bit a column and skip the SP?****
>>>
>>> Jeff****
>>>
>>> ** **
>>>
>>> *From:* nupic [mailto:[email protected]] *On Behalf Of 
>>> *Francisco
>>> Webber
>>> *Sent:* Monday, August 26, 2013 3:50 AM
>>>
>>> *To:* NuPIC general mailing list.
>>> *Subject:* Re: [nupic-dev] HTM in Natural Language Processing****
>>>
>>> ** **
>>>
>>> Ian,****
>>>
>>> I also thought about something from the Gutenberg repository.****
>>>
>>> But I think we should start with something from the Kids Shelf.****
>>>
>>> ** **
>>>
>>> There are several reasons in my opinion:****
>>>
>>> ** **
>>>
>>> - We start experimentation with a full bag of unknown parameters, so
>>> keeping the test material simple would allow us to detect the important
>>> ones sooner. And it is quite some work to create a reliable evaluation
>>> framework, so the size of the data set makes a difference.****
>>>
>>> - Keeping the text simple and short reduces substantially the overall
>>> vocabulary. If we want people to also evaluate offline, matching
>>> fingerprints can become a lengthy process without an efficient similarity
>>> engine.****
>>>
>>> - Another reason is the fact that we don't know how much a given set of
>>> columns (like the 2048 typically used) can absorb information. In other
>>> words: what is the optimal ratio between a first layer of a text-HTM and
>>> the amount of text.****
>>>
>>> - Lastly I believe that the sequence in which text is presented to the
>>> CLA is of importance. After all when humans learn information by reading,
>>> they also start from simple to complex language. The amount of new
>>> vocabulary during training, should be relatively stable (the actual amount
>>> would probably be linked to the ratio of my previous argument) ****
>>>
>>> ** **
>>>
>>> So we should build continuously more complex training data sets, finally
>>> ending up with "true"  books like the ones you listed.****
>>>
>>> ** **
>>>
>>> To start I would suggest something like:****
>>>
>>> ** **
>>>
>>> A Primary Reader: Old-time Stories, Fairy Tales and Myths Retold by
>>> Children****
>>>
>>> http://www.gutenberg.org/ebooks/7841****
>>>
>>> ** **
>>>
>>> But there might still be better ones…****
>>>
>>> ** **
>>>
>>> Francisco****
>>>
>>> ** **
>>>
>>>  ****
>>>
>>> ** **
>>>
>>> On 25.08.2013, at 23:05, Ian Danforth wrote:****
>>>
>>>
>>>
>>> ****
>>>
>>> I will make 3 suggestions. All are out of copyright, well known,
>>> uncontroversial, and still taught in schools (At least in the US)****
>>>
>>> ** **
>>>
>>> 1. Robinson Crusoe - Daniel Defoe****
>>>
>>> ** **
>>>
>>> http://www.gutenberg.org/ebooks/521****
>>>
>>> ** **
>>>
>>> 2. Great Expectations - Charles Dickens****
>>>
>>> ** **
>>>
>>> http://www.gutenberg.org/ebooks/1400****
>>>
>>> ** **
>>>
>>> 3. The Time Machine - H.G. Wells****
>>>
>>> ** **
>>>
>>> http://www.gutenberg.org/ebooks/35****
>>>
>>> ** **
>>>
>>> Ian****
>>>
>>> ** **
>>>
>>> On Sat, Aug 24, 2013 at 10:24 AM, Francisco Webber <[email protected]>
>>> wrote:****
>>>
>>> For those who don't want to use the API and for evaluation purposes, I
>>> would propose that we choose some reference text and I convert it into a
>>> sequence of SDRs. This file could be used for training.****
>>>
>>> I would also generate a list of all words contained in the text,
>>> together with their SDRs to be used as conversion table.****
>>>
>>> As a simple test measure we could feed a sequence of SDRs into a trained
>>> network and see if the HTM makes the right prediction about the following
>>> word(s). ****
>>>
>>> The last file to produce for a complete framework would be a list of
>>> lets say 100 word sequences with their correct continuation.****
>>>
>>> The word sequences could be for example the beginnings of phrases with
>>> more than n words (n being the number of steps ahead that the CLA can
>>> predict ahead)****
>>>
>>> This could be the beginning of a measuring set-up that allows to compare
>>> different CLA-implementation flavors.****
>>>
>>> ** **
>>>
>>> Any suggestions for a text to choose?****
>>>
>>> ** **
>>>
>>> Francisco****
>>>
>>> ** **
>>>
>>> On 24.08.2013, at 17:12, Matthew Taylor wrote:****
>>>
>>> ** **
>>>
>>> Very cool, Francisco. Here is where you can get cept API credentials:
>>> https://cept.3scale.net/signup****
>>>
>>>
>>> ****
>>>
>>> ---------****
>>>
>>> Matt Taylor****
>>>
>>> OS Community Flag-Bearer****
>>>
>>> Numenta****
>>>
>>> ** **
>>>
>>> On Fri, Aug 23, 2013 at 5:07 PM, Francisco Webber <[email protected]>
>>> wrote:****
>>>
>>> Just a short post scriptum:
>>>
>>> The public version of our API doesn't actually contain the generic
>>> conversion function. But if people from the HTM community want to
>>> experiment just click the "Request for Beta-Program" button and I will
>>> upgrade your accounts manually.
>>>
>>> Francisco****
>>>
>>>
>>> On 24.08.2013, at 01:59, Francisco Webber wrote:
>>>
>>> > Jeff,
>>> > I thought about this already.
>>> > We have a REST API where you can send a word in and get the SDR back,
>>> and vice versa.
>>> > I invite all who want to experiment to try it out.
>>> > You just need to get credentials at our website: www.cept.at.
>>> >
>>> > In mid-term it would be cool to create some sort of evaluation set,
>>> that could be used to measure progress while improving the CLA.
>>> >
>>> > We are continuously improving our Retina but the version that is
>>> currently online works pretty well already.
>>> >
>>> > I hope that will help
>>> >
>>> > Francisco
>>> >
>>> > On 24.08.2013, at 01:46, Jeff Hawkins wrote:
>>> >
>>> >> Francisco,
>>> >> Your work is very cool.  Do you think it would be possible to make
>>> available
>>> >> your word SDRs (or a sufficient subset of them) for experimentation?
>>>  I
>>> >> imagine there would be interested in the NuPIC community in training
>>> a CLA
>>> >> on text using your word SDRs.  You might get some useful results more
>>> >> quickly.  You could do this under a research only license or
>>> something like
>>> >> that.
>>> >> Jeff
>>> >>
>>> >> -----Original Message-----
>>> >> From: nupic [mailto:[email protected]] On Behalf Of
>>> Francisco
>>> >> Webber
>>> >> Sent: Wednesday, August 21, 2013 1:01 PM
>>> >> To: NuPIC general mailing list.
>>> >> Subject: Re: [nupic-dev] HTM in Natural Language Processing
>>> >>
>>> >> Hello,
>>> >> I am one of the founders of CEPT Systems and lead researcher of our
>>> retina
>>> >> algorithm.
>>> >>
>>> >> We have developed a method to represent words by a bitmap pattern
>>> capturing
>>> >> most of its "lexical semantics". (A text sensor) Our word-SDRs
>>> fulfill all
>>> >> the requirements for "good" HTM input data.
>>> >>
>>> >> - Words with similar meaning "look" similar
>>> >> - If you drop random bits in the representation the semantics remain
>>> intact
>>> >> - Only a small number (up to 5%) of bits are set in a word-SDR
>>> >> - Every bit in the representation corresponds to a specific semantic
>>> feature
>>> >> of the language used
>>> >> - The Retina (sensory organ for a HTM) can be trained on any language
>>> >> - The retina training process is fully unsupervised.
>>> >>
>>> >> We have found out that the word-SDR by itself (without using any HTM
>>> yet)
>>> >> can improve many NLP problems that are only poorly solved using the
>>> >> traditional statistic approaches.
>>> >> We use the SDRs to:
>>> >> - Create fingerprints of text documents which allows us to compare
>>> them for
>>> >> semantic similarity using simple (euclidian) similarity measures
>>> >> - We can automatically detect polysemy and disambiguate multiple
>>> meanings.
>>> >> - We can characterize any text with context terms for automatic
>>> >> search-engine query-expansion .
>>> >>
>>> >> We hope to successfully link-up our Retina to an HTM network to go
>>> beyond
>>> >> lexical semantics into the field of "grammatical semantics".
>>> >> This would hopefully lead to improved abstracting-, conversation-,
>>> question
>>> >> answering- and translation- systems..
>>> >>
>>> >> Our correct web address is www.cept.at (no kangaroos in Vienna ;-)
>>> >>
>>> >> I am interested in any form of cooperation to apply HTM technology to
>>> text.
>>> >>
>>> >> Francisco
>>> >>
>>> >> On 21.08.2013, at 20:16, Christian Cleber Masdeval Braz wrote:
>>> >>
>>> >>>
>>> >>> Hello.
>>> >>>
>>> >>> As many of you here i am prety new in HTM technology.
>>> >>>
>>> >>> I am a researcher in Brazil and I am going to start my Phd program
>>> soon.
>>> >> My field of interest is NLP and the extraction of knowledge from
>>> text. I am
>>> >> thinking to use the ideas behind the Memory Prediction Framework to
>>> >> investigate semantic information retrieval from the Web, and answer
>>> >> questions in natural language. I intend to use the HTM implementation
>>> as
>>> >> base to do this.
>>> >>>
>>> >>> I apreciate a lot if someone could answer some questions:
>>> >>>
>>> >>> - Are there some researches related to HTM and NLP? Could indicate
>>> them?
>>> >>>
>>> >>> - Is HTM proper to address this problem? Could it learn, without
>>> >> supervision, the grammar of a language or just help in some aspects
>>> as Named
>>> >> Entity Recognition?
>>> >>>
>>> >>>
>>> >>>
>>> >>> Regards,
>>> >>>
>>> >>> Christian
>>> >>>
>>> >>>
>>> >>> _______________________________________________
>>> >>> nupic mailing list
>>> >>> [email protected]
>>> >>> http://lists.numenta.org/mailman/listinfo/nupic_lists.numenta.org
>>> >>
>>> >>
>>> >> _______________________________________________
>>> >> nupic mailing list
>>> >> [email protected]
>>> >> http://lists.numenta.org/mailman/listinfo/nupic_lists.numenta.org
>>> >>
>>> >>
>>> >> _______________________________________________
>>> >> nupic mailing list
>>> >> [email protected]
>>> >> http://lists.numenta.org/mailman/listinfo/nupic_lists.numenta.org
>>> >
>>> >
>>> > _______________________________________________
>>> > nupic mailing list
>>> > [email protected]
>>> > http://lists.numenta.org/mailman/listinfo/nupic_lists.numenta.org
>>>
>>>
>>> _______________________________________________
>>> nupic mailing list
>>> [email protected]
>>> http://lists.numenta.org/mailman/listinfo/nupic_lists.numenta.org****
>>>
>>> ** **
>>>
>>> _______________________________________________
>>> nupic mailing list
>>> [email protected]
>>> http://lists.numenta.org/mailman/listinfo/nupic_lists.numenta.org****
>>>
>>> ** **
>>>
>>>
>>> _______________________________________________
>>> nupic mailing list
>>> [email protected]
>>> http://lists.numenta.org/mailman/listinfo/nupic_lists.numenta.org****
>>>
>>> ** **
>>>
>>> _______________________________________________
>>> nupic mailing list
>>> [email protected]
>>> http://lists.numenta.org/mailman/listinfo/nupic_lists.numenta.org****
>>>
>>> ** **
>>>
>>> _______________________________________________
>>> nupic mailing list
>>> [email protected]
>>> http://lists.numenta.org/mailman/listinfo/nupic_lists.numenta.org
>>>
>>>
>>
>>
>> --
>> James Tauber
>> http://jtauber.com/
>> @jtauber on Twitter
>>  _______________________________________________
>> nupic mailing list
>> [email protected]
>> http://lists.numenta.org/mailman/listinfo/nupic_lists.numenta.org
>>
>>
>>
>> _______________________________________________
>> nupic mailing list
>> [email protected]
>> http://lists.numenta.org/mailman/listinfo/nupic_lists.numenta.org
>>
>>
>
>
> --
> James Tauber
> http://jtauber.com/
> @jtauber on Twitter
>  _______________________________________________
> nupic mailing list
> [email protected]
> http://lists.numenta.org/mailman/listinfo/nupic_lists.numenta.org
>
>
>
> _______________________________________________
> nupic mailing list
> [email protected]
> http://lists.numenta.org/mailman/listinfo/nupic_lists.numenta.org
>
>


-- 
James Tauber
http://jtauber.com/
@jtauber on Twitter

_______________________________________________
nupic mailing list
[email protected]
http://lists.numenta.org/mailman/listinfo/nupic_lists.numenta.org

Re: [nupic-dev] HTM in Natural Language Processing

Reply via email to