We have a system that needs to be able to incrementally calculate
document-document text similarity metrics as new documents are seen.  I'm
trying to understand if using feature hashing with for example, a
StaticWordValueEncoder is appropriate for this kind of use case.  Our text
documents can contain web content so the size of the feature vector is
really only bound by the number of words in the language.

Currently our implementation uses a simple vector based bag of words type
model to create 'one cell per word' feature vectors for each document and
then we use cosine similarity to determine document to document similarity.
 We are not using Mahout.

The issue with this approach is that the one cell per word feature vectors
require the use of a singleton dictionary object to turn words into vector
indexes, so we can only index one document at a time.

I've been reading through the Mahout archives and the Mahout in Action book
looking to see if Mahout has any answers to help with incremental
parallelized vector generation, but it seems like the Mahout seq2sparse
processes have the same 'batch' issue.  I've seen various posts referring
to using feature hashing as a way around this and the classifiers in part 3
of Mahout in Action explain how to use feature hashing to encode text like
features.

I'm just too green to know whether it's appropriate for our use case.
 Particularly whether the multiple probes recommended when using feature
hashing with text, and the liklihood of feature collisions will
significantly compromise our cosine similarity calculations.

Thanks for any insights
Martin

Reply via email to