MoreLikeThis looks exactly like what I need. I would probably create a new 
"like" method to take a mahout vector and build a search? I build the vector by 
starting from a doc and reweighting certain terms. The prototype just reweights words but 
I may experiment with dirichlet clusters and reweighting an entire cluster of words so 
you could boost the importance of a topic in the results. Still the result of either 
algorithm would be a mahout vector.

Is there a description of how this works somewhere? Is it basically an index 
lookup? I always though the Google feature used precalculated results (and it 
probably does). I'm curious but mainly asking to see how fast it is.

Thanks
Pat

On 3/11/12 8:36 AM, Paul Libbrecht wrote:
Maybe that's exactly it but... given a document with n tokens A, and m tokens 
B, a query A^n B^m would find what you're looking for or?

paul

PS  I've always viewed queries as linear forms on the vector space and I'd like 
to see this really mathematically written one day...
Le 11 mars 2012 à 07:23, Lance Norskog a écrit :

Look at the MoreLikeThis feature in Lucene. I believe it does roughly
what you describe.

On Sat, Mar 10, 2012 at 9:58 AM, Pat Ferrel<p...@occamsmachete.com>  wrote:
I have a case where I'd like to get documents which most closely match a
particular vector. The RowSimilarityJob of Mahout is ideal for
precalculating similarity between existing documents but in my case the
query is constructed at run time. So the UI constructs a vector to be used
as a query. We have this running in prototype using a run time calculation
of cosine similarity but the implementation is not scalable to large doc
stores.

One thought is to calculate fairly small clusters. The UI will know which
cluster to target for the vector query. So we might be able to narrow down
the number of docs per query to a reasonable size.

It seems like a place for multiple hash functions maybe? Could we use some
kind of hack of the boost feature of Solr or some other approach?

Does anyone have a suggestion?


--
Lance Norskog
goks...@gmail.com

Reply via email to