2010/1/16 Sean Owen <sro...@gmail.com>: > 351MB isn't so bad. > > I do think the next-best idea to explore is a trie, which could use a > char->Object map data structure provided by our new collections > module? To the extent this data is more compact when encoded in UTF-8, > it will be *much* more compact encoded in a trie.
A more radical way to solve this dictionary memory issue would be to use a hashed representation of the term counts: http://hunch.net/~jl/projects/hash_reps/index.html or maybe a less radical yet more complicated to implement approach such as Counting Filters (a variant of Bloom Filters http://en.wikipedia.org/wiki/Bloom_filter#Counting_filters ). Maybe it would be best implemented as a extracting the public API of DictionaryVectorizer as an interface TermVectorizer or just Vectorizer and providing alternative implementations such as HashingVectorizer and CountingFiltersVectorizer (though I haven't checked yet if they are iso-functional even setting aside the conflict / false negative probabilities). -- Olivier http://twitter.com/ogrisel - http://code.oliviergrisel.name