2010/1/16 Drew Farris <drew.far...@gmail.com>:
> I agree the overhead of byte[] -> UTF-8 probably isn't too good for
> lookup performance.
>
> In line with Sean's suggestion, I've used tries in the past for doing
> this sort of string -> integer mapping. They generally perform well
> enough for adding entries as well as retrieval. Not nearly as
> efficient as a hash, but there is usually enough of a memory savings
> to make it worth it. They have the added benefit of making it easy to
> do prefix searches, although that isn't a strict requirement here.
>
> As Oliver suggests, a bloom filter may be an option, but wouldn't a
> secondary data structure be required to hold the actual values? Would
> false positives really be an issue with a dictionary scale problem?
>
> I presume there's a need for compact integer -> string representation
> which can be achieved by using string difference compression. Seeking
> to a mod of the id and then building up the final string by scanning
> forward through the list of incremental changes. iirc, lucene does
> something like this.

AFAIK we only use a dictionary for term value (string representation)
to term index (or more generally feature index) mapping. But then the
value is no longer needed for training and testing the models. Only
Vectors of feature values (term counts, frequencies, TF-IDF) are
needed to classify / cluster a document or train a model. Hence the
use a of a hashed representation where the dictionary from term
representations to feature indexes is only implictly represented by a
hash function up to some adjustable hash collisions rate. In practice
the collisions do not hurt convergence of models such as linear SVMs
a.k.a large margin perceptrons (or regularized logistic regression and
probably naive bayesian classifiers too) thanks to the redundant
nature of dataset features in NLP tasks (see papers cited by John
Langford  in the previous webpage for reference).

-- 
Olivier
http://twitter.com/ogrisel - http://code.oliviergrisel.name

Reply via email to