We have a system that needs to be able to incrementally calculate document-document text similarity metrics as new documents are seen. I'm trying to understand if using feature hashing with for example, a StaticWordValueEncoder is appropriate for this kind of use case. Our text documents can contain web content so the size of the feature vector is really only bound by the number of words in the language.
Currently our implementation uses a simple vector based bag of words type model to create 'one cell per word' feature vectors for each document and then we use cosine similarity to determine document to document similarity. We are not using Mahout. The issue with this approach is that the one cell per word feature vectors require the use of a singleton dictionary object to turn words into vector indexes, so we can only index one document at a time. I've been reading through the Mahout archives and the Mahout in Action book looking to see if Mahout has any answers to help with incremental parallelized vector generation, but it seems like the Mahout seq2sparse processes have the same 'batch' issue. I've seen various posts referring to using feature hashing as a way around this and the classifiers in part 3 of Mahout in Action explain how to use feature hashing to encode text like features. I'm just too green to know whether it's appropriate for our use case. Particularly whether the multiple probes recommended when using feature hashing with text, and the liklihood of feature collisions will significantly compromise our cosine similarity calculations. Thanks for any insights Martin