Interesting datapoint: After the reindexing, the following query returns the 
right results in the right order:

(+value_3:Lexington~0.877 +value_1:Massachusetts~0.877 +*:*^0.0 +*:*^0.0 
+*:*^0.0)
(+value_3:Lexington~0.877 +value_1:Massachusetts~0.877 +value_4:_empty_ 
+value_5:_empty_ +value_6:_empty_)

In this case, the score for the correct candidate is 3x the score for the 
second candidate, which is also really encouraging.

Unfortunately, the query that is automatically generated, with fuzzy matches 
for each token and with many many more clauses, still does NOT score the 
correct answer to the top.  So, either there's another clause that scores the 
bad results higher, or the agglomeration of scores across the Boolean OR is 
messing things up.  It would be great if I could change the algorithm for 
scoring for that specific BooleanQuery to pick up the max score out of the set 
of disjunction clauses, rather than computing essentially a mathematical 
average.  Any ideas how best to do that?  Is there already something around I 
can use?

Thanks!
Karl


From: Wright Karl (Nokia-MS/Boston)
Sent: Wednesday, January 26, 2011 12:17 PM
To: 'dev@lucene.apache.org'
Subject: Scoring woes?

I have an interesting scoring problem, which I can't seem to get around.

The problem is best stated as follows:

(1)    My schema has several independent fields, e.g. "value_0", "value_1", ... 
"value_6".

(2)    Every document has all of these fields set, with a-priori field norm 
values.  Where a record has no field value, the document is indexed with a 
placeholder value ("_empty_"), whose field norm is the numerical average of all 
the a-priori field norms for that field.

(3)    My query takes a set of terms and builds a list of combinations of 
these, and Ors these combinations together.  For example:

Q=Lexington Massachusetts

Query:
(+value_0:Lexington +value_0:Massachusetts)
(+value_0:Lexington +value_1:Massachusetts)
(+value_1:Lexington +value_0:Massachusetts)
...

The tricky part comes in when I try to explicitly add the "_empty_" matches.  I 
need to do this because I am trying to insure that when, say, two values are 
matched, I preferentially score the record which has only those two values the 
highest, compared to the all the records that have those two values and also a 
third one.  So, I tried this:

Query:
(+value_0:Lexington +value_0:Massachusetts +value_1:_empty_ +value_2:_empty_ + 
value_3:_empty_ + value_4:_empty_ etc.)
(+value_0:Lexington +value_1:Massachusetts +value_2:_empty_ etc.)
(+value_1:Lexington +value_0:Massachusetts +value_2:_empty_ etc.)
...

I also needed it to be possible to match all possible values instead of _empty_ 
for each of the places where that occurred.  Including no clause for these 
fields clearly messed up the queryNorm, so  I fixed that by including a 
MatchAllDocsQuery() for each missing field, this insuring that the number of 
query clauses was identical from clause to clause.

Nevertheless, I was still not seeing the shortest-match records being scored to 
the top.  So I tried to boost the _empty_ matches, like this:

(+value_0:Lexington +value_0:Massachusetts +value_1:_empty_^1000.0 
+value_2:_empty_^1000.0  + value_3:_empty_^1000.0  + value_4:_empty_^1000.0  
etc.)

That, surprisingly, did not change anything.  I suppose it must be because the 
boost is also figured into the query norm?  I'm trying another experiment now, 
reindexing with a pre-boosted field norm for _empty_ tokens.  But what I'd like 
to ask is, how exactly are you supposed to fix this problem in Lucene?  All I 
want to see is the minimal complete match be scored to the top.

Karl


Reply via email to