James,
This looks great!
Yes the apostrophe tricked the parser …
We could simply edit this in the source file and recompute the stats. In terms
of punctuation we should just keep comma, full stop, question mark, exclamation
mark. Semicolon should be changed into comma.
Even if they might not appear in these texts its always good to make the code
fail safe in this concern.
Apostrophes and quotes are usually a mess. There must be something like 250
character codes in UTF-8 that produce some character that can behave like
quotes…
Best would be to replace them with blanks. Words that are in the reduced form
like haven't should be taken as one word including the apostrophe.
I will work out the next steps over the weekend and post my achievements on
Monday.
I also need to make some adaptions on the Retina for the CLA link-up.
Looks like text 2 and 9 drop out the line a bit… maybe we should use them just
for doing unseen text tests. As they have few exclusive words. I will give it
another thought….
Thanks for your support.
Francisco
On 29.08.2013, at 05:05, James Tauber wrote:
> I pushed a Python 3 script to my repo that does a bunch of calculations.
>
> Here are the results of that script. Let me know what you'd like to see next.
> I can already see one problem in the tokenization where 'No was not split.
>
> FILENAME BYTES TOKEN TYPE
> -----------------------------------------------------
> 01_the_ugly_duckling.txt 3143 782 207
> 02_the_little_pine_tree.txt 1635 388 104
> 03_the_little_match_girl.txt 3065 701 218
> 04_little_red_riding_hood.txt 2168 509 159
> 05_the_apples_of_idun.txt 3923 934 244
> 06_how_thor_got_the_hammer.txt 5857 1373 318
> 07_the_hammer_lost_and_found.txt 4260 1010 258
> 08_the_story_of_the_sheep.txt 1265 304 129
> 09_the_good_ship_argo.txt 889 209 107
> 10_jason_and_the_harpies.txt 2187 495 173
> 11_the_brass_bulls.txt 3487 786 239
> 12_jason_and_the_dragon.txt 1867 427 180
> -----------------------------------------------------
> COLLECTION 33746 7918 882
>
> Unique to 01_the_ugly_duckling.txt:
> {'spring', 'hid', 'summer', 'dears', 'lake', 'swans', 'own', 'eggs', 'lay',
> 'still', 'eating', 'pond', 'duckling', 'yard', 'Soon', 'egg', 'bug', 'cat',
> 'bushes', 'does', 'those', 'fun', 'winter', 'duck', 'Ugly', 'lovely',
> 'woman', 'hens', 'swim', 'While', 'swan', 'sang', 'nest', 'corner', 'bread',
> 'Splash', 'because', 'mother', 'growl', 'ducks', 'An', 'Let', 'noise', 'hen',
> 'ducklings', 'Only', 'Stay', 'Duckling'}
>
> Unique to 02_the_little_pine_tree.txt:
> {'Tree', 'broken', 'bag', 'Pine', 'needles', "'No", 'green', 'Night', 'pine',
> 'nor', 'glass', 'Again'}
>
> Unique to 03_the_little_match_girl.txt:
> {'dead', 'money', 'another', 'bunch', 'star', 'death', 'step', 'matches',
> 'O', 'papa', 'candle', 'Very', 'goes', "mama's", 'name', 'Match', 'cooking',
> 'smelled', 'falls', 'more', 'stars', 'frozen', 'stove', 'slippers', 'even',
> 'whip', 'froze', 'dying', 'running', 'curly', 'sweet', 'match', 'houses',
> 'knife', 'rags', 'sell', 'herself', 'pile', 'snow', 'lights', 'dish', 'buy',
> 'dishes', 'roast', 'Girl', 'apron', 'fork', 'Her', 'street', 'bare', 'God',
> 'cloth', 'windows', 'year', 'lot', 'heaven', 'Gretchen', 'room', 'colder',
> 'candles', 'Christmas'}
>
> Unique to 04_little_red_riding_hood.txt:
> {'live', 'tapped', 'string', 'Pull', 'dear', 'pick', 'cap', 'hunter', 'mill',
> 'Does', 'hug', 'open', 'voice', 'stopped', 'wood', "grandma's", 'Red',
> 'Thank', 'Look', 'butter', 'Hood', 'Mama', 'lady', 'soft', 'red', 'six',
> 'May', 'scream', 'ears', 'basket', 'Riding', 'hood', 'mama', 'wolf'}
>
> Unique to 05_the_apples_of_idun.txt:
> {'minute', 'walls', 'beautiful', "eagle's", 'pale', 'stuck', 'breath',
> 'Apples', 'stayed', 'pole', 'field', 'against', 'Idun', 'bumped', 'nut',
> 'share', 'talking', 'Day', 'feathers', 'supper', 'changed', 'story',
> 'apples', 'box', 'Those', 'ribs', 'cross', 'fast', 'eagle', 'blazed',
> 'Please', 'gate', 'Once', 'gates', 'end', 'liked', 'cook', 'enough',
> 'please', 'putting', 'meat', 'cattle', 'upon', 'journey', 'Bring', 'four'}
>
> Unique to 06_how_thor_got_the_hammer.txt:
> {'say', 'else', 'along', 'lying', 'such', 'ring', 'Sif', 'pocket', 'shining',
> 'pay', 'proud', 'than', "Brok's", 'mischief', 'dwarfs', "dwarfs'", 'miss',
> 'getting', 'misses', 'blood', 'stop', 'mark', 'Did', 'answer', 'same',
> "wife's", 'bellows', 'throw', 'dwarf', 'neck', 'Brok', 'Sindre', 'pig',
> 'beads', 'touch', 'touching', 'fold', 'pigskin', 'wonderful', 'hurried',
> 'Odin', 'spear', 'lump', 'crown', 'horse', 'showed', 'Each', 'forehead',
> 'crying', 'busy', 'blow', 'Pretty', 'backs', 'yet', 'working', 'crooked',
> 'nice', 'thumb', "Loki's", 'Their', 'burning', "Sif's", 'standing', 'brush',
> 'cutting', 'journeys', 'sorry', 'worked', 'brother', 'Blow', 'cannot',
> 'says', 'without', 'wait', 'Somebody', 'tricks', 'Got', 'blowing', 'spoiled',
> 'anywhere'}
>
> Unique to 07_the_hammer_lost_and_found.txt:
> {'while', "Giants'", 'taken', 'planned', 'laugh', 'everything', 'eight',
> "Freyja's", 'salmon', 'Get', 'brought', "bride's", 'drank', 'servant',
> 'Found', 'Giant', 'sing', 'lap', 'shook', 'lifted', 'Any', 'necklace',
> 'dogs', 'whole', "Giant's", 'Thrym', 'clothes', 'thirsty', 'eaten',
> 'barrels', 'dress', 'bite', 'comes', 'miles', 'kiss', 'Do', 'Put', "hasn't",
> 'makes', 'braided', 'Go', "Thrym's", 'Old', 'nights', 'Freyja', 'tore',
> 'play', 'floor', 'sit', "won't", 'collars', 'shone', 'others', 'deep',
> 'drink', 'dressed', 'shine', 'Lost', 'bride', 'vail', 'buried', 'Still',
> 'talked', 'mead', 'whirled', 'wagon'}
>
> Unique to 08_the_story_of_the_sheep.txt:
> {'bad', 'Long', 'sister', 'lose', 'catch', 'Hold', 'Story', 'Helle', 'ride',
> 'garden', 'sheep', 'played', 'boy', 'First', 'ago', 'nailed', 'Sheep', 'pat',
> 'clouds', 'loved', "sheep's", 'tame', 'dizzy', 'sky', 'Every', 'tight'}
>
> Unique to 09_the_good_ship_argo.txt:
> {'creek', 'Ship', 'wade', 'strings', 'rained', 'shoe', 'To', 'wild',
> 'bridge', 'party', 'invited'}
>
> Unique to 10_jason_and_the_harpies.txt:
> {'dove', 'friends', 'wings', 'apart', 'thanked', 'close', 'skin', 'drive',
> 'These', 'drowned', 'helping', 'bye', 'boat', 'past', 'scratched', 'hill',
> 'blind', 'Row', 'waterlike', 'moved', 'sailed', 'fishes', 'together',
> 'break', 'row', 'food', 'Harpies', 'On', 'icebergs'}
>
> Unique to 11_the_brass_bulls.txt:
> {'Bulls', 'knees', 'should', 'Rub', 'burn', 'princess', 'plant', 'pushed',
> 'planted', 'tied', 'face', 'slowly', 'seats', 'stronger', 'well', 'place',
> 'wheat', 'smoke', 'hold', 'chains', 'kicked', 'run', 'plow', 'Brass',
> "bulls'", 'marble', 'creeks', 'noses', 'snakes', 'mouths', 'sword', 'noon',
> 'plowed', 'plants', 'boys', 'stone', 'evening', 'stall', 'lie', 'heads',
> 'Early', 'larger', 'Nothing'}
>
> Unique to 12_jason_and_the_dragon.txt:
> {'eats', 'sleeps', 'became', 'father', 'mouth', 'yourself', 'died', 'nail',
> 'His', "Jason's", 'fond', 'ships', 'stick', 'cakes', 'nine', 'Dragon'}
>
>
> _______________________________________________
> nupic mailing list
> [email protected]
> http://lists.numenta.org/mailman/listinfo/nupic_lists.numenta.org
_______________________________________________
nupic mailing list
[email protected]
http://lists.numenta.org/mailman/listinfo/nupic_lists.numenta.org