On Sun, Nov 13, 2011 at 10:18 PM, Jake Mannix <[email protected]> wrote:

> On Sun, Nov 13, 2011 at 10:09 PM, Ted Dunning <[email protected]>
> wrote:
>
> > That handles coherent.
> >
> > IT doesn't handle usable.
> >
> > Storing the vectors as binary payloads handles the situation for
> > projection-like applications, but that doesn't help retrieval.
> >
>
> It's not just projection, it's for added relevance: if you are already
> doing
> Lucene for your scoring needs, you already are getting some good precision
> and recall.
>
> The idea is this: you take results you are *already* scoring, and add to
> that
> scoring function an LSI cosine as one feature among many.  Hopefully it
> will improve precision, even if it will do nothing for recall (as it's only
> being
> applied to results already retrieved by the text query).
>

I have done this with Lucene (some time ago) and had a hell of a time
getting decent performance if I wanted to rescore a thousand documents from
a disk based index.  That implies a memory based system again.  The cost of
a thousand or so rescores is probably about a millisecond or so.  Since
each vector is roughly a few cache lines in size, the achievable memory
bandwidth should be significant.


Alternatively, to improve recall, at index-time, supplement each document
> by terms in a new field "lsi_expanded" which are the terms closest in the
> SVD projected space to the document, but aren't already in it.  Then at
> query time, add an "... OR lsi_expanded:<query>" clause onto your query.
> Instant query-expansion for recall enhancement.
>

This actually is pretty tricky to make well.

Reply via email to