Look at the MoreLikeThis feature in Lucene. I believe it does roughly what you describe.
On Sat, Mar 10, 2012 at 9:58 AM, Pat Ferrel <p...@occamsmachete.com> wrote: > I have a case where I'd like to get documents which most closely match a > particular vector. The RowSimilarityJob of Mahout is ideal for > precalculating similarity between existing documents but in my case the > query is constructed at run time. So the UI constructs a vector to be used > as a query. We have this running in prototype using a run time calculation > of cosine similarity but the implementation is not scalable to large doc > stores. > > One thought is to calculate fairly small clusters. The UI will know which > cluster to target for the vector query. So we might be able to narrow down > the number of docs per query to a reasonable size. > > It seems like a place for multiple hash functions maybe? Could we use some > kind of hack of the boost feature of Solr or some other approach? > > Does anyone have a suggestion? -- Lance Norskog goks...@gmail.com