Look at the MoreLikeThis feature in Lucene. I believe it does roughly
what you describe.

On Sat, Mar 10, 2012 at 9:58 AM, Pat Ferrel <p...@occamsmachete.com> wrote:
> I have a case where I'd like to get documents which most closely match a
> particular vector. The RowSimilarityJob of Mahout is ideal for
> precalculating similarity between existing documents but in my case the
> query is constructed at run time. So the UI constructs a vector to be used
> as a query. We have this running in prototype using a run time calculation
> of cosine similarity but the implementation is not scalable to large doc
> stores.
>
> One thought is to calculate fairly small clusters. The UI will know which
> cluster to target for the vector query. So we might be able to narrow down
> the number of docs per query to a reasonable size.
>
> It seems like a place for multiple hash functions maybe? Could we use some
> kind of hack of the boost feature of Solr or some other approach?
>
> Does anyone have a suggestion?



-- 
Lance Norskog
goks...@gmail.com

Reply via email to