On Sun, Nov 13, 2011 at 10:18 PM, Jake Mannix <[email protected]> wrote:
> On Sun, Nov 13, 2011 at 10:09 PM, Ted Dunning <[email protected]> > wrote: > > > That handles coherent. > > > > IT doesn't handle usable. > > > > Storing the vectors as binary payloads handles the situation for > > projection-like applications, but that doesn't help retrieval. > > > > It's not just projection, it's for added relevance: if you are already > doing > Lucene for your scoring needs, you already are getting some good precision > and recall. > > The idea is this: you take results you are *already* scoring, and add to > that > scoring function an LSI cosine as one feature among many. Hopefully it > will improve precision, even if it will do nothing for recall (as it's only > being > applied to results already retrieved by the text query). > I have done this with Lucene (some time ago) and had a hell of a time getting decent performance if I wanted to rescore a thousand documents from a disk based index. That implies a memory based system again. The cost of a thousand or so rescores is probably about a millisecond or so. Since each vector is roughly a few cache lines in size, the achievable memory bandwidth should be significant. Alternatively, to improve recall, at index-time, supplement each document > by terms in a new field "lsi_expanded" which are the terms closest in the > SVD projected space to the document, but aren't already in it. Then at > query time, add an "... OR lsi_expanded:<query>" clause onto your query. > Instant query-expansion for recall enhancement. > This actually is pretty tricky to make well.
