Francisco was kind enough to run all the words in the stories through
cept's api and provide a file with the bitmappings:

http://numenta.org/resources/hackathon/Retina.txt

This way you don't have to get a cept api key to work with the identified
texts if you'd prefer not to. I'll be adding a section to the wiki about
the NLP focus soon, which will provide guidance, list this and other
resources, and describe the currently available tools. We'll also be
identifying any gaps in our current tool set that need work before the
hackathon (like https://issues.numenta.org/browse/NPC-266), so we can try
to ensure hackers are productive.

---------
Matt Taylor
OS Community Flag-Bearer
Numenta


On Thu, Aug 29, 2013 at 2:01 PM, Francisco Webber <[email protected]> wrote:

> James,
> This looks great!
> Yes the apostrophe tricked the parser …
> We could simply edit this in the source file and recompute the stats. In
> terms of punctuation we should just keep comma, full stop, question mark,
> exclamation mark. Semicolon should be changed into comma.
> Even if they might not appear in these texts its always good to make the
> code fail safe in this concern.
> Apostrophes and quotes are usually a mess. There must be something like
> 250 character codes in UTF-8 that produce some character that can behave
> like quotes…
> Best would be to replace them with blanks. Words that are in the reduced
> form like haven't should be taken as one word including the apostrophe.
> I will work out the next steps over the weekend and post my achievements
> on Monday.
> I also need to make some adaptions on the Retina for the CLA link-up.
> Looks like text 2 and 9 drop out the line a bit… maybe we should use them
> just for doing unseen text tests. As they have few exclusive words. I will
> give it another thought….
>
> Thanks for your support.
>
> Francisco
>
>
> On 29.08.2013, at 05:05, James Tauber wrote:
>
> I pushed a Python 3 script to my repo that does a bunch of calculations.
>
> Here are the results of that script. Let me know what you'd like to see
> next. I can already see one problem in the tokenization where 'No was not
> split.
>
> FILENAME                            BYTES TOKEN  TYPE
> -----------------------------------------------------
> 01_the_ugly_duckling.txt             3143   782   207
> 02_the_little_pine_tree.txt          1635   388   104
> 03_the_little_match_girl.txt         3065   701   218
> 04_little_red_riding_hood.txt        2168   509   159
> 05_the_apples_of_idun.txt            3923   934   244
> 06_how_thor_got_the_hammer.txt       5857  1373   318
> 07_the_hammer_lost_and_found.txt     4260  1010   258
> 08_the_story_of_the_sheep.txt        1265   304   129
> 09_the_good_ship_argo.txt             889   209   107
> 10_jason_and_the_harpies.txt         2187   495   173
> 11_the_brass_bulls.txt               3487   786   239
> 12_jason_and_the_dragon.txt          1867   427   180
> -----------------------------------------------------
> COLLECTION                          33746  7918   882
>
> Unique to 01_the_ugly_duckling.txt:
> {'spring', 'hid', 'summer', 'dears', 'lake', 'swans', 'own', 'eggs',
> 'lay', 'still', 'eating', 'pond', 'duckling', 'yard', 'Soon', 'egg', 'bug',
> 'cat', 'bushes', 'does', 'those', 'fun', 'winter', 'duck', 'Ugly',
> 'lovely', 'woman', 'hens', 'swim', 'While', 'swan', 'sang', 'nest',
> 'corner', 'bread', 'Splash', 'because', 'mother', 'growl', 'ducks', 'An',
> 'Let', 'noise', 'hen', 'ducklings', 'Only', 'Stay', 'Duckling'}
>
> Unique to 02_the_little_pine_tree.txt:
> {'Tree', 'broken', 'bag', 'Pine', 'needles', "'No", 'green', 'Night',
> 'pine', 'nor', 'glass', 'Again'}
>
> Unique to 03_the_little_match_girl.txt:
> {'dead', 'money', 'another', 'bunch', 'star', 'death', 'step', 'matches',
> 'O', 'papa', 'candle', 'Very', 'goes', "mama's", 'name', 'Match',
> 'cooking', 'smelled', 'falls', 'more', 'stars', 'frozen', 'stove',
> 'slippers', 'even', 'whip', 'froze', 'dying', 'running', 'curly', 'sweet',
> 'match', 'houses', 'knife', 'rags', 'sell', 'herself', 'pile', 'snow',
> 'lights', 'dish', 'buy', 'dishes', 'roast', 'Girl', 'apron', 'fork', 'Her',
> 'street', 'bare', 'God', 'cloth', 'windows', 'year', 'lot', 'heaven',
> 'Gretchen', 'room', 'colder', 'candles', 'Christmas'}
>
> Unique to 04_little_red_riding_hood.txt:
> {'live', 'tapped', 'string', 'Pull', 'dear', 'pick', 'cap', 'hunter',
> 'mill', 'Does', 'hug', 'open', 'voice', 'stopped', 'wood', "grandma's",
> 'Red', 'Thank', 'Look', 'butter', 'Hood', 'Mama', 'lady', 'soft', 'red',
> 'six', 'May', 'scream', 'ears', 'basket', 'Riding', 'hood', 'mama', 'wolf'}
>
> Unique to 05_the_apples_of_idun.txt:
> {'minute', 'walls', 'beautiful', "eagle's", 'pale', 'stuck', 'breath',
> 'Apples', 'stayed', 'pole', 'field', 'against', 'Idun', 'bumped', 'nut',
> 'share', 'talking', 'Day', 'feathers', 'supper', 'changed', 'story',
> 'apples', 'box', 'Those', 'ribs', 'cross', 'fast', 'eagle', 'blazed',
> 'Please', 'gate', 'Once', 'gates', 'end', 'liked', 'cook', 'enough',
> 'please', 'putting', 'meat', 'cattle', 'upon', 'journey', 'Bring', 'four'}
>
> Unique to 06_how_thor_got_the_hammer.txt:
> {'say', 'else', 'along', 'lying', 'such', 'ring', 'Sif', 'pocket',
> 'shining', 'pay', 'proud', 'than', "Brok's", 'mischief', 'dwarfs',
> "dwarfs'", 'miss', 'getting', 'misses', 'blood', 'stop', 'mark', 'Did',
> 'answer', 'same', "wife's", 'bellows', 'throw', 'dwarf', 'neck', 'Brok',
> 'Sindre', 'pig', 'beads', 'touch', 'touching', 'fold', 'pigskin',
> 'wonderful', 'hurried', 'Odin', 'spear', 'lump', 'crown', 'horse',
> 'showed', 'Each', 'forehead', 'crying', 'busy', 'blow', 'Pretty', 'backs',
> 'yet', 'working', 'crooked', 'nice', 'thumb', "Loki's", 'Their', 'burning',
> "Sif's", 'standing', 'brush', 'cutting', 'journeys', 'sorry', 'worked',
> 'brother', 'Blow', 'cannot', 'says', 'without', 'wait', 'Somebody',
> 'tricks', 'Got', 'blowing', 'spoiled', 'anywhere'}
>
> Unique to 07_the_hammer_lost_and_found.txt:
> {'while', "Giants'", 'taken', 'planned', 'laugh', 'everything', 'eight',
> "Freyja's", 'salmon', 'Get', 'brought', "bride's", 'drank', 'servant',
> 'Found', 'Giant', 'sing', 'lap', 'shook', 'lifted', 'Any', 'necklace',
> 'dogs', 'whole', "Giant's", 'Thrym', 'clothes', 'thirsty', 'eaten',
> 'barrels', 'dress', 'bite', 'comes', 'miles', 'kiss', 'Do', 'Put',
> "hasn't", 'makes', 'braided', 'Go', "Thrym's", 'Old', 'nights', 'Freyja',
> 'tore', 'play', 'floor', 'sit', "won't", 'collars', 'shone', 'others',
> 'deep', 'drink', 'dressed', 'shine', 'Lost', 'bride', 'vail', 'buried',
> 'Still', 'talked', 'mead', 'whirled', 'wagon'}
>
> Unique to 08_the_story_of_the_sheep.txt:
> {'bad', 'Long', 'sister', 'lose', 'catch', 'Hold', 'Story', 'Helle',
> 'ride', 'garden', 'sheep', 'played', 'boy', 'First', 'ago', 'nailed',
> 'Sheep', 'pat', 'clouds', 'loved', "sheep's", 'tame', 'dizzy', 'sky',
> 'Every', 'tight'}
>
> Unique to 09_the_good_ship_argo.txt:
> {'creek', 'Ship', 'wade', 'strings', 'rained', 'shoe', 'To', 'wild',
> 'bridge', 'party', 'invited'}
>
> Unique to 10_jason_and_the_harpies.txt:
> {'dove', 'friends', 'wings', 'apart', 'thanked', 'close', 'skin', 'drive',
> 'These', 'drowned', 'helping', 'bye', 'boat', 'past', 'scratched', 'hill',
> 'blind', 'Row', 'waterlike', 'moved', 'sailed', 'fishes', 'together',
> 'break', 'row', 'food', 'Harpies', 'On', 'icebergs'}
>
> Unique to 11_the_brass_bulls.txt:
> {'Bulls', 'knees', 'should', 'Rub', 'burn', 'princess', 'plant', 'pushed',
> 'planted', 'tied', 'face', 'slowly', 'seats', 'stronger', 'well', 'place',
> 'wheat', 'smoke', 'hold', 'chains', 'kicked', 'run', 'plow', 'Brass',
> "bulls'", 'marble', 'creeks', 'noses', 'snakes', 'mouths', 'sword', 'noon',
> 'plowed', 'plants', 'boys', 'stone', 'evening', 'stall', 'lie', 'heads',
> 'Early', 'larger', 'Nothing'}
>
> Unique to 12_jason_and_the_dragon.txt:
> {'eats', 'sleeps', 'became', 'father', 'mouth', 'yourself', 'died',
> 'nail', 'His', "Jason's", 'fond', 'ships', 'stick', 'cakes', 'nine',
> 'Dragon'}
>
>
>  _______________________________________________
> nupic mailing list
> [email protected]
> http://lists.numenta.org/mailman/listinfo/nupic_lists.numenta.org
>
>
>
> _______________________________________________
> nupic mailing list
> [email protected]
> http://lists.numenta.org/mailman/listinfo/nupic_lists.numenta.org
>
>
_______________________________________________
nupic mailing list
[email protected]
http://lists.numenta.org/mailman/listinfo/nupic_lists.numenta.org

Reply via email to