I pushed a Python 3 script to my repo that does a bunch of calculations.
Here are the results of that script. Let me know what you'd like to see
next. I can already see one problem in the tokenization where 'No was not
split.
FILENAME BYTES TOKEN TYPE
-----------------------------------------------------
01_the_ugly_duckling.txt 3143 782 207
02_the_little_pine_tree.txt 1635 388 104
03_the_little_match_girl.txt 3065 701 218
04_little_red_riding_hood.txt 2168 509 159
05_the_apples_of_idun.txt 3923 934 244
06_how_thor_got_the_hammer.txt 5857 1373 318
07_the_hammer_lost_and_found.txt 4260 1010 258
08_the_story_of_the_sheep.txt 1265 304 129
09_the_good_ship_argo.txt 889 209 107
10_jason_and_the_harpies.txt 2187 495 173
11_the_brass_bulls.txt 3487 786 239
12_jason_and_the_dragon.txt 1867 427 180
-----------------------------------------------------
COLLECTION 33746 7918 882
Unique to 01_the_ugly_duckling.txt:
{'spring', 'hid', 'summer', 'dears', 'lake', 'swans', 'own', 'eggs', 'lay',
'still', 'eating', 'pond', 'duckling', 'yard', 'Soon', 'egg', 'bug', 'cat',
'bushes', 'does', 'those', 'fun', 'winter', 'duck', 'Ugly', 'lovely',
'woman', 'hens', 'swim', 'While', 'swan', 'sang', 'nest', 'corner',
'bread', 'Splash', 'because', 'mother', 'growl', 'ducks', 'An', 'Let',
'noise', 'hen', 'ducklings', 'Only', 'Stay', 'Duckling'}
Unique to 02_the_little_pine_tree.txt:
{'Tree', 'broken', 'bag', 'Pine', 'needles', "'No", 'green', 'Night',
'pine', 'nor', 'glass', 'Again'}
Unique to 03_the_little_match_girl.txt:
{'dead', 'money', 'another', 'bunch', 'star', 'death', 'step', 'matches',
'O', 'papa', 'candle', 'Very', 'goes', "mama's", 'name', 'Match',
'cooking', 'smelled', 'falls', 'more', 'stars', 'frozen', 'stove',
'slippers', 'even', 'whip', 'froze', 'dying', 'running', 'curly', 'sweet',
'match', 'houses', 'knife', 'rags', 'sell', 'herself', 'pile', 'snow',
'lights', 'dish', 'buy', 'dishes', 'roast', 'Girl', 'apron', 'fork', 'Her',
'street', 'bare', 'God', 'cloth', 'windows', 'year', 'lot', 'heaven',
'Gretchen', 'room', 'colder', 'candles', 'Christmas'}
Unique to 04_little_red_riding_hood.txt:
{'live', 'tapped', 'string', 'Pull', 'dear', 'pick', 'cap', 'hunter',
'mill', 'Does', 'hug', 'open', 'voice', 'stopped', 'wood', "grandma's",
'Red', 'Thank', 'Look', 'butter', 'Hood', 'Mama', 'lady', 'soft', 'red',
'six', 'May', 'scream', 'ears', 'basket', 'Riding', 'hood', 'mama', 'wolf'}
Unique to 05_the_apples_of_idun.txt:
{'minute', 'walls', 'beautiful', "eagle's", 'pale', 'stuck', 'breath',
'Apples', 'stayed', 'pole', 'field', 'against', 'Idun', 'bumped', 'nut',
'share', 'talking', 'Day', 'feathers', 'supper', 'changed', 'story',
'apples', 'box', 'Those', 'ribs', 'cross', 'fast', 'eagle', 'blazed',
'Please', 'gate', 'Once', 'gates', 'end', 'liked', 'cook', 'enough',
'please', 'putting', 'meat', 'cattle', 'upon', 'journey', 'Bring', 'four'}
Unique to 06_how_thor_got_the_hammer.txt:
{'say', 'else', 'along', 'lying', 'such', 'ring', 'Sif', 'pocket',
'shining', 'pay', 'proud', 'than', "Brok's", 'mischief', 'dwarfs',
"dwarfs'", 'miss', 'getting', 'misses', 'blood', 'stop', 'mark', 'Did',
'answer', 'same', "wife's", 'bellows', 'throw', 'dwarf', 'neck', 'Brok',
'Sindre', 'pig', 'beads', 'touch', 'touching', 'fold', 'pigskin',
'wonderful', 'hurried', 'Odin', 'spear', 'lump', 'crown', 'horse',
'showed', 'Each', 'forehead', 'crying', 'busy', 'blow', 'Pretty', 'backs',
'yet', 'working', 'crooked', 'nice', 'thumb', "Loki's", 'Their', 'burning',
"Sif's", 'standing', 'brush', 'cutting', 'journeys', 'sorry', 'worked',
'brother', 'Blow', 'cannot', 'says', 'without', 'wait', 'Somebody',
'tricks', 'Got', 'blowing', 'spoiled', 'anywhere'}
Unique to 07_the_hammer_lost_and_found.txt:
{'while', "Giants'", 'taken', 'planned', 'laugh', 'everything', 'eight',
"Freyja's", 'salmon', 'Get', 'brought', "bride's", 'drank', 'servant',
'Found', 'Giant', 'sing', 'lap', 'shook', 'lifted', 'Any', 'necklace',
'dogs', 'whole', "Giant's", 'Thrym', 'clothes', 'thirsty', 'eaten',
'barrels', 'dress', 'bite', 'comes', 'miles', 'kiss', 'Do', 'Put',
"hasn't", 'makes', 'braided', 'Go', "Thrym's", 'Old', 'nights', 'Freyja',
'tore', 'play', 'floor', 'sit', "won't", 'collars', 'shone', 'others',
'deep', 'drink', 'dressed', 'shine', 'Lost', 'bride', 'vail', 'buried',
'Still', 'talked', 'mead', 'whirled', 'wagon'}
Unique to 08_the_story_of_the_sheep.txt:
{'bad', 'Long', 'sister', 'lose', 'catch', 'Hold', 'Story', 'Helle',
'ride', 'garden', 'sheep', 'played', 'boy', 'First', 'ago', 'nailed',
'Sheep', 'pat', 'clouds', 'loved', "sheep's", 'tame', 'dizzy', 'sky',
'Every', 'tight'}
Unique to 09_the_good_ship_argo.txt:
{'creek', 'Ship', 'wade', 'strings', 'rained', 'shoe', 'To', 'wild',
'bridge', 'party', 'invited'}
Unique to 10_jason_and_the_harpies.txt:
{'dove', 'friends', 'wings', 'apart', 'thanked', 'close', 'skin', 'drive',
'These', 'drowned', 'helping', 'bye', 'boat', 'past', 'scratched', 'hill',
'blind', 'Row', 'waterlike', 'moved', 'sailed', 'fishes', 'together',
'break', 'row', 'food', 'Harpies', 'On', 'icebergs'}
Unique to 11_the_brass_bulls.txt:
{'Bulls', 'knees', 'should', 'Rub', 'burn', 'princess', 'plant', 'pushed',
'planted', 'tied', 'face', 'slowly', 'seats', 'stronger', 'well', 'place',
'wheat', 'smoke', 'hold', 'chains', 'kicked', 'run', 'plow', 'Brass',
"bulls'", 'marble', 'creeks', 'noses', 'snakes', 'mouths', 'sword', 'noon',
'plowed', 'plants', 'boys', 'stone', 'evening', 'stall', 'lie', 'heads',
'Early', 'larger', 'Nothing'}
Unique to 12_jason_and_the_dragon.txt:
{'eats', 'sleeps', 'became', 'father', 'mouth', 'yourself', 'died', 'nail',
'His', "Jason's", 'fond', 'ships', 'stick', 'cakes', 'nine', 'Dragon'}
_______________________________________________
nupic mailing list
[email protected]
http://lists.numenta.org/mailman/listinfo/nupic_lists.numenta.org