I have an interesting scoring problem, which I can't seem to get around.
The problem is best stated as follows:
(1) My schema has several independent fields, e.g. "value_0", "value_1", ...
"value_6".
(2) Every document has all of these fields set, with a-priori field norm
values. Where a record has no field value, the document is indexed with a
placeholder value ("_empty_"), whose field norm is the numerical average of all
the a-priori field norms for that field.
(3) My query takes a set of terms and builds a list of combinations of
these, and Ors these combinations together. For example:
Q=Lexington Massachusetts
Query:
(+value_0:Lexington +value_0:Massachusetts)
(+value_0:Lexington +value_1:Massachusetts)
(+value_1:Lexington +value_0:Massachusetts)
...
The tricky part comes in when I try to explicitly add the "_empty_" matches. I
need to do this because I am trying to insure that when, say, two values are
matched, I preferentially score the record which has only those two values the
highest, compared to the all the records that have those two values and also a
third one. So, I tried this:
Query:
(+value_0:Lexington +value_0:Massachusetts +value_1:_empty_ +value_2:_empty_ +
value_3:_empty_ + value_4:_empty_ etc.)
(+value_0:Lexington +value_1:Massachusetts +value_2:_empty_ etc.)
(+value_1:Lexington +value_0:Massachusetts +value_2:_empty_ etc.)
...
I also needed it to be possible to match all possible values instead of _empty_
for each of the places where that occurred. Including no clause for these
fields clearly messed up the queryNorm, so I fixed that by including a
MatchAllDocsQuery() for each missing field, this insuring that the number of
query clauses was identical from clause to clause.
Nevertheless, I was still not seeing the shortest-match records being scored to
the top. So I tried to boost the _empty_ matches, like this:
(+value_0:Lexington +value_0:Massachusetts +value_1:_empty_^1000.0
+value_2:_empty_^1000.0 + value_3:_empty_^1000.0 + value_4:_empty_^1000.0
etc.)
That, surprisingly, did not change anything. I suppose it must be because the
boost is also figured into the query norm? I'm trying another experiment now,
reindexing with a pre-boosted field norm for _empty_ tokens. But what I'd like
to ask is, how exactly are you supposed to fix this problem in Lucene? All I
want to see is the minimal complete match be scored to the top.
Karl