Hi Dennis,

You should check out payloads (arbitrary per-index-term byte[] arrays), which 
can be used to encode values which are then incorporated into documents' 
scores, by overriding Similarity.scorePayload():

<http://lucene.apache.org/java/3_0_0/api/core/org/apache/lucene/search/Similarity.html#scorePayload%28int,%20java.lang.String,%20int,%20int,%20byte[],%20int,%20int%29>

The Lucene in Action 2 MEAP has a nice introduction to using payloads to 
influence scoring, in section 6.5.

See also this (slightly out-of-date*) blog post "Getting Started with Payloads" 
by Grant Ingersoll at Lucid Imagination:

<http://www.lucidimagination.com/blog/2009/08/05/getting-started-with-payloads/>

*Note that since this blog post was written, BoostingTermQuery was renamed to 
PayloadTermQuery (in Lucene 2.9.0+ ; see 
http://issues.apache.org/jira/browse/LUCENE-1827 ; wow - this issue isn't 
mentioned in CHANGES.txt???):

<http://lucene.apache.org/java/3_0_0/api/core/org/apache/lucene/search/payloads/PayloadTermQuery.html>

Steve

On 01/28/2010 at 6:01 AM, Dennis Hendriksen wrote:
> I'm struggling to create a performant query in Lucene 3.0.0 in which I
> want to combine 'regular' scoring with scores derived from external
> sources.
> 
> For each document a fixed set of scores is calculated in the range [0.0,
> 1.0>. These scores represent the confidences that a document falls into
> categories. So for example document #1 has a score of 0.3 for cat=boys,
> 0.2 for cat=girls, 0.1 for cat=toys, 0.05 for cat=animals.
> 
> The 'regular' scoring is calculated using a BooleanQuery with TermQuerys
> similar to: -type:H +(title:dna body:dna^1.5)
> 
> In the current naive approach I'm combining the scores as following: -
> for each document store the three best categories in the following
> fields:
> name=cat1st value=boys fieldboost=0.3
> name=cat2nd value=girls fieldboost=0.2
> name=cat3rd value=toys fieldboost=0.1
> Search-time use the following query if you're interested in 'girls':
> -type:H +(title:dna body:dna^1.5) cat1st:girls cat2nd:girls cat3rd:girls 
> or if you're interested in 'boys': 
> -type:H +(title:dna body:dna^1.5) cat1st:boys cat2nd:boys cat3rd:boys
> 
> Disadvantages of the current approach:
> - loss of precision encoding/decoding boosts (performance is important,
> so this might be acceptable)
> - using TermQuery for the cat fields doesn't make a lot of sense since
> the external scores are multiplied by the idf of 'boys'/'girls' and
> the querynorm
> - the resulting score from the cat field is added to the other query
> score instead of multiplied
> 
> Just to give you an idea: the index I'm using is growing in time and
> contains about 50 million documents
> 
> Do you have an idea how I can improve my query and still keep high
> performance? Or should I combine the scores in the Collector (but this
> doesn't seem the right place to retrieve the category scores from the
> index)? Is it possible to use a different float->byte encoder per field
> to reduce the lack of precision?
> 
> Thanks for your time,
> Dennis


Reply via email to