I think in order to allow word-SDRs as input, we need to implement this pass-through encoder:
https://issues.numenta.org/browse/NPC-266 I've upped the priority of this ticket. Hopefully we can get it done in the next sprint. --------- Matt Taylor OS Community Flag-Bearer Numenta On Thu, Sep 12, 2013 at 2:00 PM, Chetan Surpur <[email protected]> wrote: > I intend to try it out as well, but I didn't see any data dump from the > CEPT API yet (I know people were talking about it in the other thread). Let > me know if you make any progress on this! If you're interested, we can even > work together on it. > > > On Thu, Sep 12, 2013 at 1:56 PM, Matthew Taylor <[email protected]> wrote: > >> Looks like Chetan is using the nupic-texts project as a data source for >> his "linguist" project. >> >> >> http://lists.numenta.org/pipermail/nupic_lists.numenta.org/2013-August/001040.html >> https://github.com/chetan51/linguist >> >> It is using a category encoder on each letter, and predicts the next >> letter within a sequence. In order to predict words in sequence instead of >> letters, I'm going to try to see how easy it will be to get word SDRs out >> of the CEPT API and input into nupic instead of letters. I'm not sure how >> much time I'll be able to spend on this, but if I get anything worthwhile, >> I'll put it up on github. >> >> Has anyone done something like this with the CEPT API yet? >> >> --------- >> Matt Taylor >> OS Community Flag-Bearer >> Numenta >> >> >> On Thu, Aug 29, 2013 at 2:01 PM, Francisco Webber <[email protected]>wrote: >> >>> James, >>> This looks great! >>> Yes the apostrophe tricked the parser … >>> We could simply edit this in the source file and recompute the stats. In >>> terms of punctuation we should just keep comma, full stop, question mark, >>> exclamation mark. Semicolon should be changed into comma. >>> Even if they might not appear in these texts its always good to make the >>> code fail safe in this concern. >>> Apostrophes and quotes are usually a mess. There must be something like >>> 250 character codes in UTF-8 that produce some character that can behave >>> like quotes… >>> Best would be to replace them with blanks. Words that are in the reduced >>> form like haven't should be taken as one word including the apostrophe. >>> I will work out the next steps over the weekend and post my achievements >>> on Monday. >>> I also need to make some adaptions on the Retina for the CLA link-up. >>> Looks like text 2 and 9 drop out the line a bit… maybe we should use >>> them just for doing unseen text tests. As they have few exclusive words. I >>> will give it another thought…. >>> >>> Thanks for your support. >>> >>> Francisco >>> >>> >>> On 29.08.2013, at 05:05, James Tauber wrote: >>> >>> I pushed a Python 3 script to my repo that does a bunch of calculations. >>> >>> Here are the results of that script. Let me know what you'd like to see >>> next. I can already see one problem in the tokenization where 'No was >>> not split. >>> >>> FILENAME BYTES TOKEN TYPE >>> ----------------------------------------------------- >>> 01_the_ugly_duckling.txt 3143 782 207 >>> 02_the_little_pine_tree.txt 1635 388 104 >>> 03_the_little_match_girl.txt 3065 701 218 >>> 04_little_red_riding_hood.txt 2168 509 159 >>> 05_the_apples_of_idun.txt 3923 934 244 >>> 06_how_thor_got_the_hammer.txt 5857 1373 318 >>> 07_the_hammer_lost_and_found.txt 4260 1010 258 >>> 08_the_story_of_the_sheep.txt 1265 304 129 >>> 09_the_good_ship_argo.txt 889 209 107 >>> 10_jason_and_the_harpies.txt 2187 495 173 >>> 11_the_brass_bulls.txt 3487 786 239 >>> 12_jason_and_the_dragon.txt 1867 427 180 >>> ----------------------------------------------------- >>> COLLECTION 33746 7918 882 >>> >>> Unique to 01_the_ugly_duckling.txt: >>> {'spring', 'hid', 'summer', 'dears', 'lake', 'swans', 'own', 'eggs', >>> 'lay', 'still', 'eating', 'pond', 'duckling', 'yard', 'Soon', 'egg', 'bug', >>> 'cat', 'bushes', 'does', 'those', 'fun', 'winter', 'duck', 'Ugly', >>> 'lovely', 'woman', 'hens', 'swim', 'While', 'swan', 'sang', 'nest', >>> 'corner', 'bread', 'Splash', 'because', 'mother', 'growl', 'ducks', 'An', >>> 'Let', 'noise', 'hen', 'ducklings', 'Only', 'Stay', 'Duckling'} >>> >>> Unique to 02_the_little_pine_tree.txt: >>> {'Tree', 'broken', 'bag', 'Pine', 'needles', "'No", 'green', 'Night', >>> 'pine', 'nor', 'glass', 'Again'} >>> >>> Unique to 03_the_little_match_girl.txt: >>> {'dead', 'money', 'another', 'bunch', 'star', 'death', 'step', >>> 'matches', 'O', 'papa', 'candle', 'Very', 'goes', "mama's", 'name', >>> 'Match', 'cooking', 'smelled', 'falls', 'more', 'stars', 'frozen', 'stove', >>> 'slippers', 'even', 'whip', 'froze', 'dying', 'running', 'curly', 'sweet', >>> 'match', 'houses', 'knife', 'rags', 'sell', 'herself', 'pile', 'snow', >>> 'lights', 'dish', 'buy', 'dishes', 'roast', 'Girl', 'apron', 'fork', 'Her', >>> 'street', 'bare', 'God', 'cloth', 'windows', 'year', 'lot', 'heaven', >>> 'Gretchen', 'room', 'colder', 'candles', 'Christmas'} >>> >>> Unique to 04_little_red_riding_hood.txt: >>> {'live', 'tapped', 'string', 'Pull', 'dear', 'pick', 'cap', 'hunter', >>> 'mill', 'Does', 'hug', 'open', 'voice', 'stopped', 'wood', "grandma's", >>> 'Red', 'Thank', 'Look', 'butter', 'Hood', 'Mama', 'lady', 'soft', 'red', >>> 'six', 'May', 'scream', 'ears', 'basket', 'Riding', 'hood', 'mama', 'wolf'} >>> >>> Unique to 05_the_apples_of_idun.txt: >>> {'minute', 'walls', 'beautiful', "eagle's", 'pale', 'stuck', 'breath', >>> 'Apples', 'stayed', 'pole', 'field', 'against', 'Idun', 'bumped', 'nut', >>> 'share', 'talking', 'Day', 'feathers', 'supper', 'changed', 'story', >>> 'apples', 'box', 'Those', 'ribs', 'cross', 'fast', 'eagle', 'blazed', >>> 'Please', 'gate', 'Once', 'gates', 'end', 'liked', 'cook', 'enough', >>> 'please', 'putting', 'meat', 'cattle', 'upon', 'journey', 'Bring', 'four'} >>> >>> Unique to 06_how_thor_got_the_hammer.txt: >>> {'say', 'else', 'along', 'lying', 'such', 'ring', 'Sif', 'pocket', >>> 'shining', 'pay', 'proud', 'than', "Brok's", 'mischief', 'dwarfs', >>> "dwarfs'", 'miss', 'getting', 'misses', 'blood', 'stop', 'mark', 'Did', >>> 'answer', 'same', "wife's", 'bellows', 'throw', 'dwarf', 'neck', 'Brok', >>> 'Sindre', 'pig', 'beads', 'touch', 'touching', 'fold', 'pigskin', >>> 'wonderful', 'hurried', 'Odin', 'spear', 'lump', 'crown', 'horse', >>> 'showed', 'Each', 'forehead', 'crying', 'busy', 'blow', 'Pretty', 'backs', >>> 'yet', 'working', 'crooked', 'nice', 'thumb', "Loki's", 'Their', 'burning', >>> "Sif's", 'standing', 'brush', 'cutting', 'journeys', 'sorry', 'worked', >>> 'brother', 'Blow', 'cannot', 'says', 'without', 'wait', 'Somebody', >>> 'tricks', 'Got', 'blowing', 'spoiled', 'anywhere'} >>> >>> Unique to 07_the_hammer_lost_and_found.txt: >>> {'while', "Giants'", 'taken', 'planned', 'laugh', 'everything', 'eight', >>> "Freyja's", 'salmon', 'Get', 'brought', "bride's", 'drank', 'servant', >>> 'Found', 'Giant', 'sing', 'lap', 'shook', 'lifted', 'Any', 'necklace', >>> 'dogs', 'whole', "Giant's", 'Thrym', 'clothes', 'thirsty', 'eaten', >>> 'barrels', 'dress', 'bite', 'comes', 'miles', 'kiss', 'Do', 'Put', >>> "hasn't", 'makes', 'braided', 'Go', "Thrym's", 'Old', 'nights', 'Freyja', >>> 'tore', 'play', 'floor', 'sit', "won't", 'collars', 'shone', 'others', >>> 'deep', 'drink', 'dressed', 'shine', 'Lost', 'bride', 'vail', 'buried', >>> 'Still', 'talked', 'mead', 'whirled', 'wagon'} >>> >>> Unique to 08_the_story_of_the_sheep.txt: >>> {'bad', 'Long', 'sister', 'lose', 'catch', 'Hold', 'Story', 'Helle', >>> 'ride', 'garden', 'sheep', 'played', 'boy', 'First', 'ago', 'nailed', >>> 'Sheep', 'pat', 'clouds', 'loved', "sheep's", 'tame', 'dizzy', 'sky', >>> 'Every', 'tight'} >>> >>> Unique to 09_the_good_ship_argo.txt: >>> {'creek', 'Ship', 'wade', 'strings', 'rained', 'shoe', 'To', 'wild', >>> 'bridge', 'party', 'invited'} >>> >>> Unique to 10_jason_and_the_harpies.txt: >>> {'dove', 'friends', 'wings', 'apart', 'thanked', 'close', 'skin', >>> 'drive', 'These', 'drowned', 'helping', 'bye', 'boat', 'past', 'scratched', >>> 'hill', 'blind', 'Row', 'waterlike', 'moved', 'sailed', 'fishes', >>> 'together', 'break', 'row', 'food', 'Harpies', 'On', 'icebergs'} >>> >>> Unique to 11_the_brass_bulls.txt: >>> {'Bulls', 'knees', 'should', 'Rub', 'burn', 'princess', 'plant', >>> 'pushed', 'planted', 'tied', 'face', 'slowly', 'seats', 'stronger', 'well', >>> 'place', 'wheat', 'smoke', 'hold', 'chains', 'kicked', 'run', 'plow', >>> 'Brass', "bulls'", 'marble', 'creeks', 'noses', 'snakes', 'mouths', >>> 'sword', 'noon', 'plowed', 'plants', 'boys', 'stone', 'evening', 'stall', >>> 'lie', 'heads', 'Early', 'larger', 'Nothing'} >>> >>> Unique to 12_jason_and_the_dragon.txt: >>> {'eats', 'sleeps', 'became', 'father', 'mouth', 'yourself', 'died', >>> 'nail', 'His', "Jason's", 'fond', 'ships', 'stick', 'cakes', 'nine', >>> 'Dragon'} >>> >>> >>> _______________________________________________ >>> nupic mailing list >>> [email protected] >>> http://lists.numenta.org/mailman/listinfo/nupic_lists.numenta.org >>> >>> >>> >>> _______________________________________________ >>> nupic mailing list >>> [email protected] >>> http://lists.numenta.org/mailman/listinfo/nupic_lists.numenta.org >>> >>> >> >> _______________________________________________ >> nupic mailing list >> [email protected] >> http://lists.numenta.org/mailman/listinfo/nupic_lists.numenta.org >> >> > > _______________________________________________ > nupic mailing list > [email protected] > http://lists.numenta.org/mailman/listinfo/nupic_lists.numenta.org > >
_______________________________________________ nupic mailing list [email protected] http://lists.numenta.org/mailman/listinfo/nupic_lists.numenta.org
