Hello,
First, apologies for the weird subject line, and apologies for cross-posting,
but last week it got no replies on the Solr user mailing list.
We index many languages and search over all those languages at once, but boost
the language of the user's preference. To differentiate between stemmed tokens
and unstemmed tokens we use KeywordRepeat and RemoveDuplicates, this works very
well.
However, we just stumbled over the following example, q=australia is not
stemmed in English, but its suffix is removed by the Romanian stemmer, causing
the Romanian results to be returned on top of English results, despite language
boosting.
This is because the Romanian part of the query consists of the stemmed and
unstemmed version of the word, but the English part of the query is just one
clause per field (title, content etc). Thus the Romanian results score roughtly
twice that of English results.
Now, this is of course really obvious, but the 'solution' is not. To work
around the problem i removed the RemoveDuplicates filter so i get two clauses
for English as well, really ugly but it works. What i don't understand is the
debug output, it doesn't list two identical clauses, instead, it doubled the
boost on the field, so instead of:
27.048403 = PayloadSpanQuery, product of:
27.048403 = weight(title_en:australia in 15850) [SchemaSimilarity],
result of:
27.048403 = score(doc=15850,freq=4.0 = phraseFreq=4.0
), product of:
7.4 = boost
3.084852 = idf(docFreq=14539, docCount=317894)
1.1848832 = tfNorm, computed as (freq * (k1 + 1)) / (freq + k1 * (1 -
b + b * fieldLength / avgFieldLength)) from:
4.0 = phraseFreq=4.0
0.3 = parameter k1
0.5 = parameter b
15.08689 = avgFieldLength
24.0 = fieldLength
1.0 = AveragePayloadFunction.docScore()
I now get
54.096806 = PayloadSpanQuery, product of:
54.096806 = weight(title_en:australia in 15850) [SchemaSimilarity],
result of:
54.096806 = score(doc=15850,freq=4.0 = phraseFreq=4.0
), product of:
14.8 = boost
3.084852 = idf(docFreq=14539, docCount=317894)
1.1848832 = tfNorm, computed as (freq * (k1 + 1)) / (freq + k1 * (1 -
b + b * fieldLength / avgFieldLength)) from:
4.0 = phraseFreq=4.0
0.3 = parameter k1
0.5 = parameter b
15.08689 = avgFieldLength
24.0 = fieldLength
1.0 = AveragePayloadFunction.docScore()
So instead of expecting two clauses in the debug, i get one but with a doubled
boost.
The question is, is this supposed to be like this?
Also, are there any real solutions to this problem? Removing the
RemoveDuplicates filter looks really silly.
Many thanks!
Markus
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]