Re: Efficient dictionary storage in memory

Olivier Grisel Sat, 16 Jan 2010 06:54:59 -0800

2010/1/16 Sean Owen <sro...@gmail.com>:
> 351MB isn't so bad.
>
> I do think the next-best idea to explore is a trie, which could use a
> char->Object map data structure provided by our new collections
> module? To the extent this data is more compact when encoded in UTF-8,
> it will be *much* more compact encoded in a trie.


A more radical way to solve this dictionary memory issue would be to
use a hashed representation of the term counts:
http://hunch.net/~jl/projects/hash_reps/index.html or maybe a less
radical yet more complicated to implement approach such as Counting
Filters (a variant of Bloom Filters
http://en.wikipedia.org/wiki/Bloom_filter#Counting_filters ).

Maybe it would be best implemented as a extracting the public API of
DictionaryVectorizer as an interface TermVectorizer or just Vectorizer
and providing alternative implementations such as HashingVectorizer
and CountingFiltersVectorizer (though I haven't checked yet if they
are iso-functional even setting aside the conflict / false negative
probabilities).

-- 
Olivier
http://twitter.com/ogrisel - http://code.oliviergrisel.name

Re: Efficient dictionary storage in memory

Reply via email to