[jira] [Comment Edited] (SOLR-17757) TFIDFSimilarity scoring difference between version 5.5.4 and 8.9.0
[ https://issues.apache.org/jira/browse/SOLR-17757?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=18056973#comment-18056973 ] Uwe Schindler edited comment on SOLR-17757 at 2/6/26 8:02 PM: -- Hi see here for the explanation what happened and why the new code is not buggy: https://github.com/apache/lucene/issues/8422#issuecomment-1223722482 The problem of people that have hit this was that their own similarity implementation did not correctly implemented query normalization so IDF was used twice. Actually there was no change in final scores, only how it was calculated. In incomplete implementations this caused issues. was (Author: thetaphi): Hi see here for the explanation what happened and why the new code is not buggy: #8422 (comment) The problem of people that have hit this was that their own similarity implementation did not correctly implemented query normalization so IDF was used twice. Actually there was no change in final scores, only how it was calculated. In incomplete implementations this caused issues. > TFIDFSimilarity scoring difference between version 5.5.4 and 8.9.0 > -- > > Key: SOLR-17757 > URL: https://issues.apache.org/jira/browse/SOLR-17757 > Project: Solr > Issue Type: Bug > Components: search >Reporter: parveen saini >Priority: Critical > Labels: TFIDF, similarity > Attachments: image-2025-06-04-23-42-47-309.png, > image-2025-06-04-23-42-47-382.png > > > On migrating solr version from 5.5.4 to 8.9.0 I noticed that TFIDFSimilarity > scoring is different and results in different overall score for the query. > On digging deeper I found idf is factored twice in version 5.5.4 which is > causing the issue. Is the change in version 8.9.0 intentional? -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
[jira] [Comment Edited] (SOLR-17757) TFIDFSimilarity scoring difference between version 5.5.4 and 8.9.0
[ https://issues.apache.org/jira/browse/SOLR-17757?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=18056973#comment-18056973 ] Uwe Schindler edited comment on SOLR-17757 at 2/6/26 8:01 PM: -- Hi see here for the explanation what happened and why the new code is not buggy: #8422 (comment) The problem of people that have hit this was that their own similarity implementation did not correctly implemented query normalization so IDF was used twice. Actually there was no change in final scores, only how it was calculated. In incomplete implementations this caused issues. was (Author: thetaphi): Hi see here for the explanation what happened and why the new code is not buggy: https://github.com/apache/lucene/issues/8422#issuecomment-1223722482 The problem of people that gave hit this was that their implementation did not correctly implemented query normalization so IDF was used twice. > TFIDFSimilarity scoring difference between version 5.5.4 and 8.9.0 > -- > > Key: SOLR-17757 > URL: https://issues.apache.org/jira/browse/SOLR-17757 > Project: Solr > Issue Type: Bug > Components: search >Reporter: parveen saini >Priority: Critical > Labels: TFIDF, similarity > Attachments: image-2025-06-04-23-42-47-309.png, > image-2025-06-04-23-42-47-382.png > > > On migrating solr version from 5.5.4 to 8.9.0 I noticed that TFIDFSimilarity > scoring is different and results in different overall score for the query. > On digging deeper I found idf is factored twice in version 5.5.4 which is > causing the issue. Is the change in version 8.9.0 intentional? -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
[jira] [Comment Edited] (SOLR-17757) TFIDFSimilarity scoring difference between version 5.5.4 and 8.9.0
[
https://issues.apache.org/jira/browse/SOLR-17757?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17955987#comment-17955987
]
Khaled Alkhouli edited comment on SOLR-17757 at 6/4/25 7:52 AM:
This is probably not a bug. A major change was made to the way the score was
calculated. If you're using BM25 (which uses tf-idf), then the absolute scoring
will be lower because lucene changed the calculation of BM25 to remove a
multiplication factor in the numerator. As per the documentation of the 8.0.0
release, if you have not explicitly specified any {{similarityFactory}} in your
schema, or if you're using the default {{{}SchemaSimilarityFactory{}}}, then
{{LegacyBM25Similarity}} is automatically selected only if the
{{luceneMatchVersion}} is set lower than 8.0.0. If your {{luceneMatchVersion}}
is 8.0.0 or higher, and you're using a newer lucene version, then solr will use
the updated BM25Similarity by default which explains the new scoring behavior.
I didn't find any ticket or documentation that shows that the idf is factored
twice in version 5.5.4. Please provide the source that says so to be more
helpful.
Refer to the following documentation for more clarification
[https://solr.apache.org/guide/8_0/major-changes-in-solr-8.html]
For more technical details see this ticket
https://issues.apache.org/jira/browse/LUCENE-8563
You can also review the PR linked in that ticket for exact code changes.
was (Author: JIRAUSER307908):
This is probably not a bug. A major change was made to the way the score was
calculated. If you're using BM25 (which uses tf-idf), then the absolute scoring
will be lower because lucene changed the calculation of BM25 to remove a
multiplication factor in the numerator. As per the documentation of the 8.0.0
release, if you have not explicitly specified any {{similarityFactory}} in your
schema, or if you're using the default {{{}SchemaSimilarityFactory{}}}, then
{{LegacyBM25Similarity}} is automatically selected only ** if the
{{luceneMatchVersion}} is set lower than 8.0.0. If your {{luceneMatchVersion}}
is 8.0.0 or higher, and you're using a newer lucene version, then solr will use
the updated ** BM25Similarity by default which explains the new scoring
behavior.
I didn't find any ticket or documentation that shows that the idf is factored
twice in version 5.5.4. Please provide the source that says so to be more
helpful.
Refer to the following documentation for more clarification
[https://solr.apache.org/guide/8_0/major-changes-in-solr-8.html]
For more technical details see this ticket
https://issues.apache.org/jira/browse/LUCENE-8563
You can also review the PR linked in that ticket for exact code changes.
> TFIDFSimilarity scoring difference between version 5.5.4 and 8.9.0
> --
>
> Key: SOLR-17757
> URL: https://issues.apache.org/jira/browse/SOLR-17757
> Project: Solr
> Issue Type: Bug
> Components: search
>Reporter: parveen saini
>Priority: Critical
> Labels: TFIDF, similarity
>
> On migrating solr version from 5.5.4 to 8.9.0 I noticed that TFIDFSimilarity
> scoring is different and results in different overall score for the query.
> On digging deeper I found idf is factored twice in version 5.5.4 which is
> causing the issue. Is the change in version 8.9.0 intentional?
--
This message was sent by Atlassian Jira
(v8.20.10#820010)
-
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]
