On Sun, Nov 13, 2011 at 10:31 PM, Ted Dunning <ted.dunn...@gmail.com> wrote: > > I have done this with Lucene (some time ago) and had a hell of a time > getting decent performance if I wanted to rescore a thousand documents from > a disk based index. That implies a memory based system again. The cost of > a thousand or so rescores is probably about a millisecond or so. Since > each vector is roughly a few cache lines in size, the achievable memory > bandwidth should be significant. >
Yeah, I guess I tend to make the assumption that everyone is all in memory like I've been the past 4 years or so. I have no idea what the current Lucene cost of looking up additional binary payloads from disk is while in the inner loop. I could totally believe it's prohibitive. > Alternatively, to improve recall, at index-time, supplement each document > > by terms in a new field "lsi_expanded" which are the terms closest in the > > SVD projected space to the document, but aren't already in it. Then at > > query time, add an "... OR lsi_expanded:<query>" clause onto your query. > > Instant query-expansion for recall enhancement. > > > > This actually is pretty tricky to make well. > I never said it was necessarily a *good* idea to use LSI in this way (or, in fact, to use LSI at all), just that if you *do* have a good scoring model (like some kind of strongly predictive static prior, like PageRank), then doing even fairly dumb recall-enhancing techniques can improve things quite nicely, and "discretized" LSI like this is a "not completely dumb" way to enhance recall. -jake