We did something related for a recent project. Basically

 - Build a Lucene index using transformed data
 - Build the search query using similar transformations
 - Then take the top N, and do a more expensive scoring calculation

In the end, after much tweaking, it worked well - able to handle 1000 
queries/sec on a biggish AWS box, by keeping everything in memory.

-- Ken

On Nov 13, 2011, at 10:18pm, Jake Mannix wrote:

> On Sun, Nov 13, 2011 at 10:09 PM, Ted Dunning <[email protected]> wrote:
> 
>> That handles coherent.
>> 
>> IT doesn't handle usable.
>> 
>> Storing the vectors as binary payloads handles the situation for
>> projection-like applications, but that doesn't help retrieval.
>> 
> 
> It's not just projection, it's for added relevance: if you are already doing
> Lucene for your scoring needs, you already are getting some good precision
> and recall.
> 
> The idea is this: you take results you are *already* scoring, and add to
> that
> scoring function an LSI cosine as one feature among many.  Hopefully it
> will improve precision, even if it will do nothing for recall (as it's only
> being
> applied to results already retrieved by the text query).
> 
> Alternatively, to improve recall, at index-time, supplement each document
> by terms in a new field "lsi_expanded" which are the terms closest in the
> SVD projected space to the document, but aren't already in it.  Then at
> query time, add an "... OR lsi_expanded:<query>" clause onto your query.
> Instant query-expansion for recall enhancement.
> 
> Or do both, and play with both your precision and recall.
> 
>  -jake

--------------------------
Ken Krugler
http://www.scaleunlimited.com
custom big data solutions & training
Hadoop, Cascading, Mahout & Solr




Reply via email to