Francisco was kind enough to run all the words in the stories through cept's api and provide a file with the bitmappings:
http://numenta.org/resources/hackathon/Retina.txt This way you don't have to get a cept api key to work with the identified texts if you'd prefer not to. I'll be adding a section to the wiki about the NLP focus soon, which will provide guidance, list this and other resources, and describe the currently available tools. We'll also be identifying any gaps in our current tool set that need work before the hackathon (like https://issues.numenta.org/browse/NPC-266), so we can try to ensure hackers are productive. --------- Matt Taylor OS Community Flag-Bearer Numenta On Thu, Aug 29, 2013 at 2:01 PM, Francisco Webber <[email protected]> wrote: > James, > This looks great! > Yes the apostrophe tricked the parser … > We could simply edit this in the source file and recompute the stats. In > terms of punctuation we should just keep comma, full stop, question mark, > exclamation mark. Semicolon should be changed into comma. > Even if they might not appear in these texts its always good to make the > code fail safe in this concern. > Apostrophes and quotes are usually a mess. There must be something like > 250 character codes in UTF-8 that produce some character that can behave > like quotes… > Best would be to replace them with blanks. Words that are in the reduced > form like haven't should be taken as one word including the apostrophe. > I will work out the next steps over the weekend and post my achievements > on Monday. > I also need to make some adaptions on the Retina for the CLA link-up. > Looks like text 2 and 9 drop out the line a bit… maybe we should use them > just for doing unseen text tests. As they have few exclusive words. I will > give it another thought…. > > Thanks for your support. > > Francisco > > > On 29.08.2013, at 05:05, James Tauber wrote: > > I pushed a Python 3 script to my repo that does a bunch of calculations. > > Here are the results of that script. Let me know what you'd like to see > next. I can already see one problem in the tokenization where 'No was not > split. > > FILENAME BYTES TOKEN TYPE > ----------------------------------------------------- > 01_the_ugly_duckling.txt 3143 782 207 > 02_the_little_pine_tree.txt 1635 388 104 > 03_the_little_match_girl.txt 3065 701 218 > 04_little_red_riding_hood.txt 2168 509 159 > 05_the_apples_of_idun.txt 3923 934 244 > 06_how_thor_got_the_hammer.txt 5857 1373 318 > 07_the_hammer_lost_and_found.txt 4260 1010 258 > 08_the_story_of_the_sheep.txt 1265 304 129 > 09_the_good_ship_argo.txt 889 209 107 > 10_jason_and_the_harpies.txt 2187 495 173 > 11_the_brass_bulls.txt 3487 786 239 > 12_jason_and_the_dragon.txt 1867 427 180 > ----------------------------------------------------- > COLLECTION 33746 7918 882 > > Unique to 01_the_ugly_duckling.txt: > {'spring', 'hid', 'summer', 'dears', 'lake', 'swans', 'own', 'eggs', > 'lay', 'still', 'eating', 'pond', 'duckling', 'yard', 'Soon', 'egg', 'bug', > 'cat', 'bushes', 'does', 'those', 'fun', 'winter', 'duck', 'Ugly', > 'lovely', 'woman', 'hens', 'swim', 'While', 'swan', 'sang', 'nest', > 'corner', 'bread', 'Splash', 'because', 'mother', 'growl', 'ducks', 'An', > 'Let', 'noise', 'hen', 'ducklings', 'Only', 'Stay', 'Duckling'} > > Unique to 02_the_little_pine_tree.txt: > {'Tree', 'broken', 'bag', 'Pine', 'needles', "'No", 'green', 'Night', > 'pine', 'nor', 'glass', 'Again'} > > Unique to 03_the_little_match_girl.txt: > {'dead', 'money', 'another', 'bunch', 'star', 'death', 'step', 'matches', > 'O', 'papa', 'candle', 'Very', 'goes', "mama's", 'name', 'Match', > 'cooking', 'smelled', 'falls', 'more', 'stars', 'frozen', 'stove', > 'slippers', 'even', 'whip', 'froze', 'dying', 'running', 'curly', 'sweet', > 'match', 'houses', 'knife', 'rags', 'sell', 'herself', 'pile', 'snow', > 'lights', 'dish', 'buy', 'dishes', 'roast', 'Girl', 'apron', 'fork', 'Her', > 'street', 'bare', 'God', 'cloth', 'windows', 'year', 'lot', 'heaven', > 'Gretchen', 'room', 'colder', 'candles', 'Christmas'} > > Unique to 04_little_red_riding_hood.txt: > {'live', 'tapped', 'string', 'Pull', 'dear', 'pick', 'cap', 'hunter', > 'mill', 'Does', 'hug', 'open', 'voice', 'stopped', 'wood', "grandma's", > 'Red', 'Thank', 'Look', 'butter', 'Hood', 'Mama', 'lady', 'soft', 'red', > 'six', 'May', 'scream', 'ears', 'basket', 'Riding', 'hood', 'mama', 'wolf'} > > Unique to 05_the_apples_of_idun.txt: > {'minute', 'walls', 'beautiful', "eagle's", 'pale', 'stuck', 'breath', > 'Apples', 'stayed', 'pole', 'field', 'against', 'Idun', 'bumped', 'nut', > 'share', 'talking', 'Day', 'feathers', 'supper', 'changed', 'story', > 'apples', 'box', 'Those', 'ribs', 'cross', 'fast', 'eagle', 'blazed', > 'Please', 'gate', 'Once', 'gates', 'end', 'liked', 'cook', 'enough', > 'please', 'putting', 'meat', 'cattle', 'upon', 'journey', 'Bring', 'four'} > > Unique to 06_how_thor_got_the_hammer.txt: > {'say', 'else', 'along', 'lying', 'such', 'ring', 'Sif', 'pocket', > 'shining', 'pay', 'proud', 'than', "Brok's", 'mischief', 'dwarfs', > "dwarfs'", 'miss', 'getting', 'misses', 'blood', 'stop', 'mark', 'Did', > 'answer', 'same', "wife's", 'bellows', 'throw', 'dwarf', 'neck', 'Brok', > 'Sindre', 'pig', 'beads', 'touch', 'touching', 'fold', 'pigskin', > 'wonderful', 'hurried', 'Odin', 'spear', 'lump', 'crown', 'horse', > 'showed', 'Each', 'forehead', 'crying', 'busy', 'blow', 'Pretty', 'backs', > 'yet', 'working', 'crooked', 'nice', 'thumb', "Loki's", 'Their', 'burning', > "Sif's", 'standing', 'brush', 'cutting', 'journeys', 'sorry', 'worked', > 'brother', 'Blow', 'cannot', 'says', 'without', 'wait', 'Somebody', > 'tricks', 'Got', 'blowing', 'spoiled', 'anywhere'} > > Unique to 07_the_hammer_lost_and_found.txt: > {'while', "Giants'", 'taken', 'planned', 'laugh', 'everything', 'eight', > "Freyja's", 'salmon', 'Get', 'brought', "bride's", 'drank', 'servant', > 'Found', 'Giant', 'sing', 'lap', 'shook', 'lifted', 'Any', 'necklace', > 'dogs', 'whole', "Giant's", 'Thrym', 'clothes', 'thirsty', 'eaten', > 'barrels', 'dress', 'bite', 'comes', 'miles', 'kiss', 'Do', 'Put', > "hasn't", 'makes', 'braided', 'Go', "Thrym's", 'Old', 'nights', 'Freyja', > 'tore', 'play', 'floor', 'sit', "won't", 'collars', 'shone', 'others', > 'deep', 'drink', 'dressed', 'shine', 'Lost', 'bride', 'vail', 'buried', > 'Still', 'talked', 'mead', 'whirled', 'wagon'} > > Unique to 08_the_story_of_the_sheep.txt: > {'bad', 'Long', 'sister', 'lose', 'catch', 'Hold', 'Story', 'Helle', > 'ride', 'garden', 'sheep', 'played', 'boy', 'First', 'ago', 'nailed', > 'Sheep', 'pat', 'clouds', 'loved', "sheep's", 'tame', 'dizzy', 'sky', > 'Every', 'tight'} > > Unique to 09_the_good_ship_argo.txt: > {'creek', 'Ship', 'wade', 'strings', 'rained', 'shoe', 'To', 'wild', > 'bridge', 'party', 'invited'} > > Unique to 10_jason_and_the_harpies.txt: > {'dove', 'friends', 'wings', 'apart', 'thanked', 'close', 'skin', 'drive', > 'These', 'drowned', 'helping', 'bye', 'boat', 'past', 'scratched', 'hill', > 'blind', 'Row', 'waterlike', 'moved', 'sailed', 'fishes', 'together', > 'break', 'row', 'food', 'Harpies', 'On', 'icebergs'} > > Unique to 11_the_brass_bulls.txt: > {'Bulls', 'knees', 'should', 'Rub', 'burn', 'princess', 'plant', 'pushed', > 'planted', 'tied', 'face', 'slowly', 'seats', 'stronger', 'well', 'place', > 'wheat', 'smoke', 'hold', 'chains', 'kicked', 'run', 'plow', 'Brass', > "bulls'", 'marble', 'creeks', 'noses', 'snakes', 'mouths', 'sword', 'noon', > 'plowed', 'plants', 'boys', 'stone', 'evening', 'stall', 'lie', 'heads', > 'Early', 'larger', 'Nothing'} > > Unique to 12_jason_and_the_dragon.txt: > {'eats', 'sleeps', 'became', 'father', 'mouth', 'yourself', 'died', > 'nail', 'His', "Jason's", 'fond', 'ships', 'stick', 'cakes', 'nine', > 'Dragon'} > > > _______________________________________________ > nupic mailing list > [email protected] > http://lists.numenta.org/mailman/listinfo/nupic_lists.numenta.org > > > > _______________________________________________ > nupic mailing list > [email protected] > http://lists.numenta.org/mailman/listinfo/nupic_lists.numenta.org > >
_______________________________________________ nupic mailing list [email protected] http://lists.numenta.org/mailman/listinfo/nupic_lists.numenta.org
