I think in order to allow word-SDRs as input, we need to implement this
pass-through encoder:

https://issues.numenta.org/browse/NPC-266

I've upped the priority of this ticket. Hopefully we can get it done in the
next sprint.

---------
Matt Taylor
OS Community Flag-Bearer
Numenta


On Thu, Sep 12, 2013 at 2:00 PM, Chetan Surpur <[email protected]> wrote:

> I intend to try it out as well, but I didn't see any data dump from the
> CEPT API yet (I know people were talking about it in the other thread). Let
> me know if you make any progress on this! If you're interested, we can even
> work together on it.
>
>
> On Thu, Sep 12, 2013 at 1:56 PM, Matthew Taylor <[email protected]> wrote:
>
>> Looks like Chetan is using the nupic-texts project as a data source for
>> his "linguist" project.
>>
>>
>> http://lists.numenta.org/pipermail/nupic_lists.numenta.org/2013-August/001040.html
>> https://github.com/chetan51/linguist
>>
>> It is using a category encoder on each letter, and predicts the next
>> letter within a sequence. In order to predict words in sequence instead of
>> letters, I'm going to try to see how easy it will be to get word SDRs out
>> of the CEPT API and input into nupic instead of letters. I'm not sure how
>> much time I'll be able to spend on this, but if I get anything worthwhile,
>> I'll put it up on github.
>>
>> Has anyone done something like this with the CEPT API yet?
>>
>> ---------
>> Matt Taylor
>> OS Community Flag-Bearer
>>  Numenta
>>
>>
>> On Thu, Aug 29, 2013 at 2:01 PM, Francisco Webber <[email protected]>wrote:
>>
>>> James,
>>> This looks great!
>>> Yes the apostrophe tricked the parser …
>>> We could simply edit this in the source file and recompute the stats. In
>>> terms of punctuation we should just keep comma, full stop, question mark,
>>> exclamation mark. Semicolon should be changed into comma.
>>> Even if they might not appear in these texts its always good to make the
>>> code fail safe in this concern.
>>> Apostrophes and quotes are usually a mess. There must be something like
>>> 250 character codes in UTF-8 that produce some character that can behave
>>> like quotes…
>>> Best would be to replace them with blanks. Words that are in the reduced
>>> form like haven't should be taken as one word including the apostrophe.
>>> I will work out the next steps over the weekend and post my achievements
>>> on Monday.
>>> I also need to make some adaptions on the Retina for the CLA link-up.
>>> Looks like text 2 and 9 drop out the line a bit… maybe we should use
>>> them just for doing unseen text tests. As they have few exclusive words. I
>>> will give it another thought….
>>>
>>> Thanks for your support.
>>>
>>> Francisco
>>>
>>>
>>> On 29.08.2013, at 05:05, James Tauber wrote:
>>>
>>> I pushed a Python 3 script to my repo that does a bunch of calculations.
>>>
>>> Here are the results of that script. Let me know what you'd like to see
>>> next. I can already see one problem in the tokenization where 'No was
>>> not split.
>>>
>>> FILENAME                            BYTES TOKEN  TYPE
>>> -----------------------------------------------------
>>> 01_the_ugly_duckling.txt             3143   782   207
>>> 02_the_little_pine_tree.txt          1635   388   104
>>> 03_the_little_match_girl.txt         3065   701   218
>>> 04_little_red_riding_hood.txt        2168   509   159
>>> 05_the_apples_of_idun.txt            3923   934   244
>>> 06_how_thor_got_the_hammer.txt       5857  1373   318
>>> 07_the_hammer_lost_and_found.txt     4260  1010   258
>>> 08_the_story_of_the_sheep.txt        1265   304   129
>>> 09_the_good_ship_argo.txt             889   209   107
>>> 10_jason_and_the_harpies.txt         2187   495   173
>>> 11_the_brass_bulls.txt               3487   786   239
>>> 12_jason_and_the_dragon.txt          1867   427   180
>>> -----------------------------------------------------
>>> COLLECTION                          33746  7918   882
>>>
>>> Unique to 01_the_ugly_duckling.txt:
>>> {'spring', 'hid', 'summer', 'dears', 'lake', 'swans', 'own', 'eggs',
>>> 'lay', 'still', 'eating', 'pond', 'duckling', 'yard', 'Soon', 'egg', 'bug',
>>> 'cat', 'bushes', 'does', 'those', 'fun', 'winter', 'duck', 'Ugly',
>>> 'lovely', 'woman', 'hens', 'swim', 'While', 'swan', 'sang', 'nest',
>>> 'corner', 'bread', 'Splash', 'because', 'mother', 'growl', 'ducks', 'An',
>>> 'Let', 'noise', 'hen', 'ducklings', 'Only', 'Stay', 'Duckling'}
>>>
>>> Unique to 02_the_little_pine_tree.txt:
>>> {'Tree', 'broken', 'bag', 'Pine', 'needles', "'No", 'green', 'Night',
>>> 'pine', 'nor', 'glass', 'Again'}
>>>
>>> Unique to 03_the_little_match_girl.txt:
>>> {'dead', 'money', 'another', 'bunch', 'star', 'death', 'step',
>>> 'matches', 'O', 'papa', 'candle', 'Very', 'goes', "mama's", 'name',
>>> 'Match', 'cooking', 'smelled', 'falls', 'more', 'stars', 'frozen', 'stove',
>>> 'slippers', 'even', 'whip', 'froze', 'dying', 'running', 'curly', 'sweet',
>>> 'match', 'houses', 'knife', 'rags', 'sell', 'herself', 'pile', 'snow',
>>> 'lights', 'dish', 'buy', 'dishes', 'roast', 'Girl', 'apron', 'fork', 'Her',
>>> 'street', 'bare', 'God', 'cloth', 'windows', 'year', 'lot', 'heaven',
>>> 'Gretchen', 'room', 'colder', 'candles', 'Christmas'}
>>>
>>> Unique to 04_little_red_riding_hood.txt:
>>> {'live', 'tapped', 'string', 'Pull', 'dear', 'pick', 'cap', 'hunter',
>>> 'mill', 'Does', 'hug', 'open', 'voice', 'stopped', 'wood', "grandma's",
>>> 'Red', 'Thank', 'Look', 'butter', 'Hood', 'Mama', 'lady', 'soft', 'red',
>>> 'six', 'May', 'scream', 'ears', 'basket', 'Riding', 'hood', 'mama', 'wolf'}
>>>
>>> Unique to 05_the_apples_of_idun.txt:
>>> {'minute', 'walls', 'beautiful', "eagle's", 'pale', 'stuck', 'breath',
>>> 'Apples', 'stayed', 'pole', 'field', 'against', 'Idun', 'bumped', 'nut',
>>> 'share', 'talking', 'Day', 'feathers', 'supper', 'changed', 'story',
>>> 'apples', 'box', 'Those', 'ribs', 'cross', 'fast', 'eagle', 'blazed',
>>> 'Please', 'gate', 'Once', 'gates', 'end', 'liked', 'cook', 'enough',
>>> 'please', 'putting', 'meat', 'cattle', 'upon', 'journey', 'Bring', 'four'}
>>>
>>> Unique to 06_how_thor_got_the_hammer.txt:
>>> {'say', 'else', 'along', 'lying', 'such', 'ring', 'Sif', 'pocket',
>>> 'shining', 'pay', 'proud', 'than', "Brok's", 'mischief', 'dwarfs',
>>> "dwarfs'", 'miss', 'getting', 'misses', 'blood', 'stop', 'mark', 'Did',
>>> 'answer', 'same', "wife's", 'bellows', 'throw', 'dwarf', 'neck', 'Brok',
>>> 'Sindre', 'pig', 'beads', 'touch', 'touching', 'fold', 'pigskin',
>>> 'wonderful', 'hurried', 'Odin', 'spear', 'lump', 'crown', 'horse',
>>> 'showed', 'Each', 'forehead', 'crying', 'busy', 'blow', 'Pretty', 'backs',
>>> 'yet', 'working', 'crooked', 'nice', 'thumb', "Loki's", 'Their', 'burning',
>>> "Sif's", 'standing', 'brush', 'cutting', 'journeys', 'sorry', 'worked',
>>> 'brother', 'Blow', 'cannot', 'says', 'without', 'wait', 'Somebody',
>>> 'tricks', 'Got', 'blowing', 'spoiled', 'anywhere'}
>>>
>>> Unique to 07_the_hammer_lost_and_found.txt:
>>> {'while', "Giants'", 'taken', 'planned', 'laugh', 'everything', 'eight',
>>> "Freyja's", 'salmon', 'Get', 'brought', "bride's", 'drank', 'servant',
>>> 'Found', 'Giant', 'sing', 'lap', 'shook', 'lifted', 'Any', 'necklace',
>>> 'dogs', 'whole', "Giant's", 'Thrym', 'clothes', 'thirsty', 'eaten',
>>> 'barrels', 'dress', 'bite', 'comes', 'miles', 'kiss', 'Do', 'Put',
>>> "hasn't", 'makes', 'braided', 'Go', "Thrym's", 'Old', 'nights', 'Freyja',
>>> 'tore', 'play', 'floor', 'sit', "won't", 'collars', 'shone', 'others',
>>> 'deep', 'drink', 'dressed', 'shine', 'Lost', 'bride', 'vail', 'buried',
>>> 'Still', 'talked', 'mead', 'whirled', 'wagon'}
>>>
>>> Unique to 08_the_story_of_the_sheep.txt:
>>> {'bad', 'Long', 'sister', 'lose', 'catch', 'Hold', 'Story', 'Helle',
>>> 'ride', 'garden', 'sheep', 'played', 'boy', 'First', 'ago', 'nailed',
>>> 'Sheep', 'pat', 'clouds', 'loved', "sheep's", 'tame', 'dizzy', 'sky',
>>> 'Every', 'tight'}
>>>
>>> Unique to 09_the_good_ship_argo.txt:
>>> {'creek', 'Ship', 'wade', 'strings', 'rained', 'shoe', 'To', 'wild',
>>> 'bridge', 'party', 'invited'}
>>>
>>> Unique to 10_jason_and_the_harpies.txt:
>>> {'dove', 'friends', 'wings', 'apart', 'thanked', 'close', 'skin',
>>> 'drive', 'These', 'drowned', 'helping', 'bye', 'boat', 'past', 'scratched',
>>> 'hill', 'blind', 'Row', 'waterlike', 'moved', 'sailed', 'fishes',
>>> 'together', 'break', 'row', 'food', 'Harpies', 'On', 'icebergs'}
>>>
>>> Unique to 11_the_brass_bulls.txt:
>>> {'Bulls', 'knees', 'should', 'Rub', 'burn', 'princess', 'plant',
>>> 'pushed', 'planted', 'tied', 'face', 'slowly', 'seats', 'stronger', 'well',
>>> 'place', 'wheat', 'smoke', 'hold', 'chains', 'kicked', 'run', 'plow',
>>> 'Brass', "bulls'", 'marble', 'creeks', 'noses', 'snakes', 'mouths',
>>> 'sword', 'noon', 'plowed', 'plants', 'boys', 'stone', 'evening', 'stall',
>>> 'lie', 'heads', 'Early', 'larger', 'Nothing'}
>>>
>>> Unique to 12_jason_and_the_dragon.txt:
>>> {'eats', 'sleeps', 'became', 'father', 'mouth', 'yourself', 'died',
>>> 'nail', 'His', "Jason's", 'fond', 'ships', 'stick', 'cakes', 'nine',
>>> 'Dragon'}
>>>
>>>
>>>  _______________________________________________
>>> nupic mailing list
>>> [email protected]
>>> http://lists.numenta.org/mailman/listinfo/nupic_lists.numenta.org
>>>
>>>
>>>
>>> _______________________________________________
>>> nupic mailing list
>>> [email protected]
>>> http://lists.numenta.org/mailman/listinfo/nupic_lists.numenta.org
>>>
>>>
>>
>> _______________________________________________
>> nupic mailing list
>> [email protected]
>> http://lists.numenta.org/mailman/listinfo/nupic_lists.numenta.org
>>
>>
>
> _______________________________________________
> nupic mailing list
> [email protected]
> http://lists.numenta.org/mailman/listinfo/nupic_lists.numenta.org
>
>
_______________________________________________
nupic mailing list
[email protected]
http://lists.numenta.org/mailman/listinfo/nupic_lists.numenta.org

Reply via email to