I have a case where I'd like to get documents which most closely match a particular vector. The RowSimilarityJob of Mahout is ideal for precalculating similarity between existing documents but in my case the query is constructed at run time. So the UI constructs a vector to be used as a query. We have this running in prototype using a run time calculation of cosine similarity but the implementation is not scalable to large doc stores.

One thought is to calculate fairly small clusters. The UI will know which cluster to target for the vector query. So we might be able to narrow down the number of docs per query to a reasonable size.

It seems like a place for multiple hash functions maybe? Could we use some kind of hack of the boost feature of Solr or some other approach?

Does anyone have a suggestion?

Reply via email to