RE: Custom lucene scoring - Dot product between field boost and query boost

Yuval Kesten Thu, 23 Feb 2012 04:21:42 -0800

One important thing - 
Since I am not using the indexed documents fields' norms, because the weight is 
the value of the field, I am now indexing the fields using:
Field field = new Field(field_name, Float.toString(weight), Store.YES, 
Index.NOT_ANALYZED_NO_NORMS);
And the memory usage is back to normal... So cool!


-----Original Message-----
From: Yuval Kesten [mailto:ykes...@yahoo-inc.com] 
Sent: Wednesday, February 22, 2012 7:29 PM
To: java-user@lucene.apache.org
Subject: RE: Custom lucene scoring - Dot product between field boost and query 
boost

Hi all,
Inspired by another thread here (Question about CustomScoreQuery) I am using 
this solution which is working really well (with one drawback):
I discovered that some of my problems were due to the fact that my assumption 
was wrong:
I did have many fields/queries terms with the same field ID.
This ruined my approach because the query boost was aggregated and my 
calculations were wrong.

What I did was during indexing I added the field value to the field id 
(concatenated it by '_') and as filed value used the desired score.

At search time I am using simple FieldScoreQuery (As-is, no modifications 
needed) with the complex field ID.
Here I can still use the setBoost to set the score because now my filed are 
unique.

Logic wise this is perfect - dot product using Lucene.

Drawback - Lots of lots of different types of fields - effects the memory usage 
dramatically.

If anyone has better ideas - please share!

-----Original Message-----
From: Alan Woodward [mailto:alan.woodw...@romseysoftware.co.uk]
Sent: Wednesday, February 22, 2012 4:00 PM
To: java-user@lucene.apache.org
Subject: Re: Custom lucene scoring - Dot product between field boost and query 
boost

Hi Yuval,

You can just override Similarity, rather than DefaultSimilarity - that way you 
don't burn any CPU cycles on TF/IDF calculations.

Alan

On 22 Feb 2012, at 07:17, Yuval Kesten wrote:

> Hi Em,
> 1. Regarding the performances - the similarity class (And my subtype as well) 
> gets the IDF and TF and SQUARED SUMS calculations as inputs - they just 
> factor them differently. Even though I ignore the values they are being 
> computed.
> 2. I have written this code:
>    static {
>        Similarity.setDefault(new MySimilarity());
>    }
> Which means that I am setting the default similarity before doing the 
> indexing and obviously before the searching.
> Thanks!
> 
> -----Original Message-----
> From: Em [mailto:mailformailingli...@yahoo.de]
> Sent: Tuesday, February 21, 2012 6:07 PM
> To: java-user@lucene.apache.org
> Subject: Re: Custom lucene scoring - Dot product between field boost 
> and query boost
> 
> Hi Yuval,
> 
>> 1. Performances: I am calculating all the TF/IDF stuff and NORMS for 
>> nothing...
> You aren't calculating that much, since you declared all those values as 
> constants. What are you worried about?
> 
>> 2. The score I get from the TopScoreDocCollector is not the same as I
> get from the Explanation.
>> Here is part of my code:
> Could you provide us the code where you are setting the Similarity, please?
> 
> Kind regards,
> Em
> 
> Am 21.02.2012 16:18, schrieb Yuval Kesten:
>> Hi,
>> I want to use Lucene with the following scoring logic:
>> When I index my documents I want to set for each field a score/weight.
>> When I query my index I want to set for each query term a score/weight.
>> 
>> I will NEVER index or query with many instances of the same field - In each 
>> query (document) there will be 0-1 instances with the same field name.
>> My fields/query term are not analyzed - they are already made out of one 
>> token.
>> 
>> I want the score to be simply the dot product between the fields of the 
>> query to the fields of the document if they have the same value.
>> 
>> For example:
>> Query:
>> Field Name
>> 
>> Field Value
>> 
>> Field Score
>> 
>> 1
>> 
>> AA
>> 
>> 0.1
>> 
>> 7
>> 
>> BB
>> 
>> 0.2
>> 
>> 8
>> 
>> CC
>> 
>> 0.3
>> 
>> 
>> Document 1:
>> Field Name
>> 
>> Field Value
>> 
>> Field Score
>> 
>> 1
>> 
>> AA
>> 
>> 0.2
>> 
>> 2
>> 
>> DD
>> 
>> 0.8
>> 
>> 7
>> 
>> CC
>> 
>> 0.999
>> 
>> 10
>> 
>> FFF
>> 
>> 0.1
>> 
>> 
>> Document 2:
>> Field Name
>> 
>> Field Value
>> 
>> Field Score
>> 
>> 7
>> 
>> BB
>> 
>> 0.3
>> 
>> 8
>> 
>> CC
>> 
>> 0.5
>> 
>> 
>> The scores should be:
>> Score(q,d1) = FIELD_1_SCORE_Q * FILED_1_SCORE_D1 = 0.1 * 0.2  = 0.02
>> Score(q,d2) = FIELD_7_SCORE_Q * FILED_7_SCORE_D2 + FIELD_8_SCORE_Q *
>> FILED_8_SCORE_D2 = (0.2 * 0.3) + (0.3 * 0.5)
>> 
>> What would be the best way implement it? In terms of accuracy and 
>> performances (I don't need TF and IDF calculations).
>> 
>> I currently implemented it by setting boosts to the fields and query terms.
>> Then I overwritten the DefaultSimilarity class:
>> 
>> public class MySimilarity extends DefaultSimilarity {
>> 
>>    @Override
>>    public float computeNorm(String field, FieldInvertState state) {
>>        return state.getBoost();
>>    }
>> 
>>    @Override
>>    public float queryNorm(float sumOfSquaredWeights) {
>>        return 1;
>>    }
>> 
>>    @Override
>>    public float tf(float freq) {
>>        return 1;
>>    }
>> 
>>    @Override
>>    public float idf(int docFreq, int numDocs) {
>>        return 1;
>>    }
>> 
>>    @Override
>>    public float coord(int overlap, int maxOverlap) {
>>        return 1;
>>    }
>> 
>> }
>> 
>> And based on 
>> http://lucene.apache.org/core/old_versioned_docs/versions/3_5_0/scoring.html 
>> this should work.
>> Problems:
>> 1. Performances: I am calculating all the TF/IDF stuff and NORMS for 
>> nothing...
>> 2. The score I get from the TopScoreDocCollector is not the same as I get 
>> from the Explanation.
>> Here is part of my code:
>> 
>> indexSearcher = new IndexSearcher(IndexReader.open(directory, true)); 
>> TopScoreDocCollector collector = TopScoreDocCollector.create(iTopN,
>> true); indexSearcher.search(query, collector); ScoreDoc[] hits = 
>> collector.topDocs().scoreDocs; for (int i = 0; i < hits.length; ++i) 
>> { int docId = hits[i].doc; Document d = indexSearcher.doc(docId); 
>> double score = hits[i].score; String id = d.get(FIELD_ID); 
>> Explanation explanation = indexSearcher.explain(query, docId); }
>> 
>> Thanks!
>> 
>> 
> 
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
> For additional commands, e-mail: java-user-h...@lucene.apache.org
> 
> 
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
> For additional commands, e-mail: java-user-h...@lucene.apache.org
> 


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org

RE: Custom lucene scoring - Dot product between field boost and query boost

Reply via email to