Scoring woes?

karl.wright Wed, 26 Jan 2011 09:18:01 -0800

I have an interesting scoring problem, which I can't seem to get around.

The problem is best stated as follows:


(1)    My schema has several independent fields, e.g. "value_0", "value_1", ... 
"value_6".

(2)    Every document has all of these fields set, with a-priori field norm 
values.  Where a record has no field value, the document is indexed with a 
placeholder value ("_empty_"), whose field norm is the numerical average of all 
the a-priori field norms for that field.

(3)    My query takes a set of terms and builds a list of combinations of 
these, and Ors these combinations together.  For example:

Q=Lexington Massachusetts

Query:
(+value_0:Lexington +value_0:Massachusetts)
(+value_0:Lexington +value_1:Massachusetts)
(+value_1:Lexington +value_0:Massachusetts)
...

The tricky part comes in when I try to explicitly add the "_empty_" matches.  I 
need to do this because I am trying to insure that when, say, two values are 
matched, I preferentially score the record which has only those two values the 
highest, compared to the all the records that have those two values and also a 
third one.  So, I tried this:

Query:
(+value_0:Lexington +value_0:Massachusetts +value_1:_empty_ +value_2:_empty_ + 
value_3:_empty_ + value_4:_empty_ etc.)
(+value_0:Lexington +value_1:Massachusetts +value_2:_empty_ etc.)
(+value_1:Lexington +value_0:Massachusetts +value_2:_empty_ etc.)
...

I also needed it to be possible to match all possible values instead of _empty_ 
for each of the places where that occurred.  Including no clause for these 
fields clearly messed up the queryNorm, so  I fixed that by including a 
MatchAllDocsQuery() for each missing field, this insuring that the number of 
query clauses was identical from clause to clause.

Nevertheless, I was still not seeing the shortest-match records being scored to 
the top.  So I tried to boost the _empty_ matches, like this:

(+value_0:Lexington +value_0:Massachusetts +value_1:_empty_^1000.0 
+value_2:_empty_^1000.0  + value_3:_empty_^1000.0  + value_4:_empty_^1000.0  
etc.)

That, surprisingly, did not change anything.  I suppose it must be because the 
boost is also figured into the query norm?  I'm trying another experiment now, 
reindexing with a pre-boosted field norm for _empty_ tokens.  But what I'd like 
to ask is, how exactly are you supposed to fix this problem in Lucene?  All I 
want to see is the minimal complete match be scored to the top.

Karl

Scoring woes?

Reply via email to