RE: Re: Custom scores and sort
Hi Adrien . Thank you for your reply. Here a detailed example in order to clarify what I try to do: The name of the “only once score” field = “onlyOnce”, its boost = 5; 2 documents: 1. doc1 has 2 times the onlyOnce field with the values “2” and “3”, plus some other fields 2. doc2 has 1 onlyOnce field with the value “2”, plus some other fields The SHOULD query = custom(onlyOnce:2) custom(onlyOnce:3) The “onlyOnce” field must be counted only once per document; to this end, I give my CustomScoreQuery subclass a map “doc ID to field name” as argument (doc ID is my ID, not the doc of Lucene): 1. doc1: * at calculation of the custom score of onlyOnce 2, the map is filled with doc1 ID to “onlyOnce” and the returned subscore = 1 * at calculation of the custom score of onlyOnce 3, as the map already contains the key doc1 ID with the value “onlyOnce” the returned subscore = 0 2. doc2: * at calculation of the custom score of onlyOnce 2, the pair doc2 ID to “onlyOnce” is added to the map and the returned subscore = 1 Therefore doc1 and doc2 get the same final subscore for the “onlyOnce” field: subscore x boost = 1 x 5 = 5. The TopFieldDocs search as well as the TopDocs return the same correct final score. All is OK for the final score. But, actually, there are other fields than “onlyOnce” and I use a TopFieldDocs search to sort by score first and then by a date field and by a third field. The Lucene explanation shows that the TopFieldDocs search does not use the correct final score to sort: for doc1 as well as doc2, it uses a score (fields[0]) where the contribution of the “onlyOnce” field is 0 and not 5; the reason I suspect is that to sort it passes through the CustomScoreQuery subclass while the map contains already the doc1 and doc2 pairs. And the result is that for some hits a hit with a lower total final score can be ranked before a hit with a higher score. The test with a TopDocs search returns the correct final score of 5 and the default sorting by relevance only is correct. Why is fields[0] which is used to sort the TopFieldDocs hits not the final score? I agree with you, I must conclude that my CustomScoreQuery subclass breaks some Lucene assumptions. About your last question about the LongDistanceFeatureQuery, I don’t know it, it is not in the version 5 of Lucene I use. Claude Lepère From: Adrien Grand Sent: Wednesday, March 23, 2022 17:58 To: Lucene Users Mailing List Subject: Re: Re: Custom scores and sort CAUTION: external mail Sorry Claude, but I have some trouble following what you are doing with your CustomScoreQuery. It feels like your query is doing something that breaks some assumptions that Lucene makes. Have you looked at existing ways that Lucene supports boosting documents by recency, such as putting a LongDistanceFeatureQuery as a SHOULD clause in a BooleanQuery? On Mon, Mar 14, 2022 at 7:00 PM Claude Lepere mailto:claudelep...@gmail.com>> wrote: > > Adrien, thank you for your answer and sorry for the lack of clarity. > > No, the score of a document does not depend on the score of another > document, the problem lies within a document. > > There are several "only once score" fields; to simplify, I suppose there is > only one "only once score" field; > a document can contain several times this "only once score" field with > different values; > a query can contain several clauses on the different values of this field > and these clauses can be SHOULD or MUST. > But for such a document, the score of this field should only be counted on > the first pass through my CustomScoreQuery subclass, on subsequent passes, > the custom score = 0 ; > to process so, the constructor of the subclass has as argument the map "my > document id (not Lucene doc!) to the field". > > Then, the score of the first pass is multiplied by a date factor which > depends on the age of the document (age = maximum date of the query results > - date of the document): > the score of a document decreases with its age. > > The total score (field + date) is correctly calculated, but the explanation > log shows that the sort score (the first element of fields[]) is not the > total score but the total score minus the "only once score" or to put it > another way, a total score where the "only once score" = 0, and that's why > a hit with a lower total score happens to be ranked before a hit with a > higher total score. > > The log of my CustomScoreQuery subclass shows that even if the document > contains only one "only once score" field, > Lucene passes the CustomScoreProvider's customScore method twice, so the > score = 0 and it seems to me that this value is retained for the sort score. > > I did not find why a TopFieldDocs search (with Sort = SortField.FIELD_SCORE > and
Re: Re: Custom scores and sort
Sorry Claude, but I have some trouble following what you are doing with your CustomScoreQuery. It feels like your query is doing something that breaks some assumptions that Lucene makes. Have you looked at existing ways that Lucene supports boosting documents by recency, such as putting a LongDistanceFeatureQuery as a SHOULD clause in a BooleanQuery? On Mon, Mar 14, 2022 at 7:00 PM Claude Lepere wrote: > > Adrien, thank you for your answer and sorry for the lack of clarity. > > No, the score of a document does not depend on the score of another > document, the problem lies within a document. > > There are several "only once score" fields; to simplify, I suppose there is > only one "only once score" field; > a document can contain several times this "only once score" field with > different values; > a query can contain several clauses on the different values of this field > and these clauses can be SHOULD or MUST. > But for such a document, the score of this field should only be counted on > the first pass through my CustomScoreQuery subclass, on subsequent passes, > the custom score = 0 ; > to process so, the constructor of the subclass has as argument the map "my > document id (not Lucene doc!) to the field". > > Then, the score of the first pass is multiplied by a date factor which > depends on the age of the document (age = maximum date of the query results > - date of the document): > the score of a document decreases with its age. > > The total score (field + date) is correctly calculated, but the explanation > log shows that the sort score (the first element of fields[]) is not the > total score but the total score minus the "only once score" or to put it > another way, a total score where the "only once score" = 0, and that's why > a hit with a lower total score happens to be ranked before a hit with a > higher total score. > > The log of my CustomScoreQuery subclass shows that even if the document > contains only one "only once score" field, > Lucene passes the CustomScoreProvider's customScore method twice, so the > score = 0 and it seems to me that this value is retained for the sort score. > > I did not find why a TopFieldDocs search (with Sort = SortField.FIELD_SCORE > and date) uses the "diminished" score and not the total score, as TopDocs > does. > > > Thanks in advance. > > > Claude Lepère > > On 2022/03/14 12:59:45 Adrien Grand wrote: > > It's a bit hard for me to parse what you are trying to do, but it > > looks like you are making assumptions about how Lucene works > > internally that are not correct. > > > > Do I understand correctly that your scoring mechanism has dependencies > > on other documents, ie. the score of a document could depend on the > > score of other documents? This is something that Lucene doesn't > > support. > > > > On Thu, Mar 10, 2022 at 12:23 PM Claude Lepere wrote: > > > > > > Hi. > > > The problem is that although sorting by score a match with a lower > score is > > > ranked before a match with a greater score. > > > The origin of the problem lies in a subclass of CustomScoreQuery which > > > calculates an "only once" score for each document: on the first pass the > > > document gets its score and, if the document contains several times the > > > same field, on the subsequent passes it gets 0. > > > I wonder if it is possible for Lucene to give a score that depends on a > > > previous pass in the CustomScoreProvider customScore routine for the > same > > > document. > > > I ran 2 searches with IndexSearcher: the first one returns a TopDocs > which > > > is sorted by default by relevance, and the second search - with the Sort > > > array = [SortField.FIELD_SCORE, a date SortField] argument - returns a > > > TopFieldDocs. > > > The TopDocs results are sorted by the score with the first pass value of > > > the only once method while the TopFieldDocs results are sorted by the > score > > > with the value (= 0) of the next pass, hence the ranking errors. > > > I did not find why does the TopFieldDocs search not use to sort the > score > > > of the hit, as the TopDocs search? > > > I did not find how to tell the TopFieldDocs search to use the hit score > to > > > sort. > > > > > > Claude Lepère > > > > > > > > -- > > Adrien > > > > - > > To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org > > For additional commands, e-mail: java-user-h...@lucene.apache.org > > > > -- Adrien - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org
RE: Re: Custom scores and sort
Adrien, thank you for your answer and sorry for the lack of clarity. No, the score of a document does not depend on the score of another document, the problem lies within a document. There are several "only once score" fields; to simplify, I suppose there is only one "only once score" field; a document can contain several times this "only once score" field with different values; a query can contain several clauses on the different values of this field and these clauses can be SHOULD or MUST. But for such a document, the score of this field should only be counted on the first pass through my CustomScoreQuery subclass, on subsequent passes, the custom score = 0 ; to process so, the constructor of the subclass has as argument the map "my document id (not Lucene doc!) to the field". Then, the score of the first pass is multiplied by a date factor which depends on the age of the document (age = maximum date of the query results - date of the document): the score of a document decreases with its age. The total score (field + date) is correctly calculated, but the explanation log shows that the sort score (the first element of fields[]) is not the total score but the total score minus the "only once score" or to put it another way, a total score where the "only once score" = 0, and that's why a hit with a lower total score happens to be ranked before a hit with a higher total score. The log of my CustomScoreQuery subclass shows that even if the document contains only one "only once score" field, Lucene passes the CustomScoreProvider's customScore method twice, so the score = 0 and it seems to me that this value is retained for the sort score. I did not find why a TopFieldDocs search (with Sort = SortField.FIELD_SCORE and date) uses the "diminished" score and not the total score, as TopDocs does. Thanks in advance. Claude Lepère On 2022/03/14 12:59:45 Adrien Grand wrote: > It's a bit hard for me to parse what you are trying to do, but it > looks like you are making assumptions about how Lucene works > internally that are not correct. > > Do I understand correctly that your scoring mechanism has dependencies > on other documents, ie. the score of a document could depend on the > score of other documents? This is something that Lucene doesn't > support. > > On Thu, Mar 10, 2022 at 12:23 PM Claude Lepere wrote: > > > > Hi. > > The problem is that although sorting by score a match with a lower score is > > ranked before a match with a greater score. > > The origin of the problem lies in a subclass of CustomScoreQuery which > > calculates an "only once" score for each document: on the first pass the > > document gets its score and, if the document contains several times the > > same field, on the subsequent passes it gets 0. > > I wonder if it is possible for Lucene to give a score that depends on a > > previous pass in the CustomScoreProvider customScore routine for the same > > document. > > I ran 2 searches with IndexSearcher: the first one returns a TopDocs which > > is sorted by default by relevance, and the second search - with the Sort > > array = [SortField.FIELD_SCORE, a date SortField] argument - returns a > > TopFieldDocs. > > The TopDocs results are sorted by the score with the first pass value of > > the only once method while the TopFieldDocs results are sorted by the score > > with the value (= 0) of the next pass, hence the ranking errors. > > I did not find why does the TopFieldDocs search not use to sort the score > > of the hit, as the TopDocs search? > > I did not find how to tell the TopFieldDocs search to use the hit score to > > sort. > > > > Claude Lepère > > > > -- > Adrien > > - > To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org > For additional commands, e-mail: java-user-h...@lucene.apache.org > >
Re: Custom scores and sort
It's a bit hard for me to parse what you are trying to do, but it looks like you are making assumptions about how Lucene works internally that are not correct. Do I understand correctly that your scoring mechanism has dependencies on other documents, ie. the score of a document could depend on the score of other documents? This is something that Lucene doesn't support. On Thu, Mar 10, 2022 at 12:23 PM Claude Lepere wrote: > > Hi. > The problem is that although sorting by score a match with a lower score is > ranked before a match with a greater score. > The origin of the problem lies in a subclass of CustomScoreQuery which > calculates an "only once" score for each document: on the first pass the > document gets its score and, if the document contains several times the > same field, on the subsequent passes it gets 0. > I wonder if it is possible for Lucene to give a score that depends on a > previous pass in the CustomScoreProvider customScore routine for the same > document. > I ran 2 searches with IndexSearcher: the first one returns a TopDocs which > is sorted by default by relevance, and the second search - with the Sort > array = [SortField.FIELD_SCORE, a date SortField] argument - returns a > TopFieldDocs. > The TopDocs results are sorted by the score with the first pass value of > the only once method while the TopFieldDocs results are sorted by the score > with the value (= 0) of the next pass, hence the ranking errors. > I did not find why does the TopFieldDocs search not use to sort the score > of the hit, as the TopDocs search? > I did not find how to tell the TopFieldDocs search to use the hit score to > sort. > > Claude Lepère -- Adrien - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org
RE: Custom scores and sort
Hi! I see where the problem lies but I can't find a way to solve it. First feature: one of the fields must be scored only once: if a document matches this field several times (the values are different), the score is counted only the first time. A map is given as an argument to the CustomScoreQuery, it registers that the document has been scored once and that all subsequent matches must result in a score of 0. A second feature: another CustomScoreQuery multiplies each sub-score by a factor based on the date of the document: document A that matches better than document B but is older may receive a lower final score than document B. The calculation of the final total score (only once score field + date) gives the expected correct result (the Explanation shows it) but in some cases - because of the date correction - the ranking is wrong, a document with a lower final total score is ranked before a document with a higher score. In scoreDoc.toString(), the score=... part and the fields=[score, ...] part do not have the same score value, that of fields=[] is smaller: the difference is equal to the score of the "only once score" field multiplied by the date factor. This fields part represents the Sort requested from the IndexSearcher. This difference exists for all hits, whether the document has the "only once score" field once or more. Why this difference? When debugging, I see that the IndexSearcher search enters at least a second time into the "only once score" CustomScoreQuery and that it is the 0 score that is finally retained since the record that the score has already been given was made for each match. I can't figure out how to solve this problem, I'm not sure if there is a solution since a score depends on a previous score; I've tried the FunctionQuery route without success but I'm not sure that technique applies here either. Am I making a mistake somewhere? I can only see re-sorting all the hits at the end, apart from Lucene, as a workaround. I would be very happy if someone could point me to a better solution. Thanks in advance. Claude Lepère On 2022/02/21 09:56:18 Claude Lepere wrote: > Hi! I have a question with sorting, I don’t understand why in a test a hit > with a lower score is ranked before hits with higher scores. > > I am using Lucene 5.2.1. > > > > Two CustomScoreQuery subqueries on two fields, subquery 1 and subquery 2, > and two test cases: > > case 1: the two calculated custom scores are multiplied by the same factor > depending on the date of the match at the end of the customScore method of > CustomScoreProvider > > case 2: the two calculated custom scores are *not* multiplied by the date > factor. > > > > All tests with the same Sort, by score then by date. > > > > Case 1: with date factor: > > > > Test 1: subquery 1 only: > > two hits, doc A (date A) gets the score A1, doc B (date B) gets the score > B1: score A1 > score B1, date A < date B, and doc A is ranked before doc B > > Explanation: > > doc A score A1 shardIndex=0 fields=[score A1, date A] > > doc B score B1 shardIndex=0 fields=[score B1, date B] > > > > That's correct. > > > > > > Test 2: MUST query subquery 1, subquery 2: > > the two same docs match: doc A (date A) gets the score A2, doc B (date B) > gets the score B2: score A2 *<* score B2, date A < date B, and *doc A is > ranked before doc B* > > Explanation: > > doc A score A2 shardIndex=0 fields=[score A1, date A] > > doc B score B2 shardIndex=0 fields=[score B1, date B] > > > > *doc A is ranked before doc B although score A2 < score B2 and sorting > should use scores A2 and B2, not A1 and B1.* > > > > > > > > Case 2: without date factor: > > > > Test 1: subquery 1 only: > > doc A (date A) gets the score A1, doc B (date B) gets the score B1: score > A1 > score B1, date A < date B, and doc A is ranked before doc B > > Explanation: > > doc A score A1 shardIndex=0 fields=[score A1, date A] > > doc B score B1 shardIndex=0 fields=[score B1, date B] > > > > > > Test 2: MUST query subquery 1, subquery 2: > > the two same docs match: doc A (date A) gets the score A2, doc B (date B) > gets the score B2: score A2 *>* score B2, date A < date B, and doc A is > ranked before doc B > > Explanation: > > doc A score A2 shardIndex=0 fields=[score A1, date A] > > doc B score B2 shardIndex=0 fields=[score B1, date B] > > > > Using score A1 here works: without the date factor, all the hits of test 2 > match subquery 2 in the same way and they get the same sub-score: the > explanation shows in this case that the score = field[0] score + the common > sub-score of the hits, therefore the sorting is the same by current score > as by field[0] score. > > > > But, with the date factor, this is no longer true, the sort [Score, date] > should use the current scores of test 2 and not those of test 1. > > > > > > Please, could someone enlighten me? Do I make a mistake somewhere? > > > > Claude Lepère > > < http://www.avg.com/email-signature?utm_medium=email&utm_source=link&utm_campaign=sig-e
Re: Custom scores and sort
Hello Claude, here is what I'm doing and it seems to work, I haven't yet created failure tests. Maybe more expert member will have more information. Date field inserted: final Date parse = DATE_FORMAT.parse(DATE_FORMAT.format(o1)); new LongPoint(attributeName, parse.getTime())); The sorter: Sort sort = new Sort(SortField.FIELD_SCORE, new SortField(LAST_UPDATE, SortField.Type.STRING)); The query: TopDocs docs = searcher.search(q, maxCount, sort); The records are inserted with 1 sec delay (for tests purposes only) Stephane -Original Message- From: Claude Lepere Reply-To: java-user@lucene.apache.org To: java-user@lucene.apache.org Subject: Custom scores and sort Date: Mon, 21 Feb 2022 10:56:18 +0100 Hi! I have a question with sorting, I don’t understand why in a test a hitwith a lower score is ranked before hits with higher scores. I am using Lucene 5.2.1. Two CustomScoreQuery subqueries on two fields, subquery 1 and subquery 2,and two test cases: case 1: the two calculated custom scores are multiplied by the same factordepending on the date of the match at the end of the customScore method ofCustomScoreProvider case 2: the two calculated custom scores are *not* multiplied by the datefactor. All tests with the same Sort, by score then by date. Case 1: with date factor: Test 1: subquery 1 only: two hits, doc A (date A) gets the score A1, doc B (date B) gets the scoreB1: score A1 > score B1, date A < date B, and doc A is ranked before doc B Explanation: doc A score A1 shardIndex=0 fields=[score A1, date A] doc B score B1 shardIndex=0 fields=[score B1, date B] That's correct. Test 2: MUST query subquery 1, subquery 2: the two same docs match: doc A (date A) gets the score A2, doc B (date B)gets the score B2: score A2 *<* score B2, date A < date B, and *doc A isranked before doc B* Explanation: doc A score A2 shardIndex=0 fields=[score A1, date A] doc B score B2 shardIndex=0 fields=[score B1, date B] *doc A is ranked before doc B although score A2 < score B2 and sortingshould use scores A2 and B2, not A1 and B1.* Case 2: without date factor: Test 1: subquery 1 only: doc A (date A) gets the score A1, doc B (date B) gets the score B1: scoreA1 > score B1, date A < date B, and doc A is ranked before doc B Explanation: doc A score A1 shardIndex=0 fields=[score A1, date A] doc B score B1 shardIndex=0 fields=[score B1, date B] Test 2: MUST query subquery 1, subquery 2: the two same docs match: doc A (date A) gets the score A2, doc B (date B)gets the score B2: score A2 *>* score B2, date A < date B, and doc A isranked before doc B Explanation: doc A score A2 shardIndex=0 fields=[score A1, date A] doc B score B2 shardIndex=0 fields=[score B1, date B] Using score A1 here works: without the date factor, all the hits of test 2match subquery 2 in the same way and they get the same sub- score: theexplanation shows in this case that the score = field[0] score + the commonsub-score of the hits, therefore the sorting is the same by current scoreas by field[0] score. But, with the date factor, this is no longer true, the sort [Score, date]should use the current scores of test 2 and not those of test 1. Please, could someone enlighten me? Do I make a mistake somewhere? Claude Lepère < http://www.avg.com/email-signature?utm_medium=email&utm_source=link&utm_campaign=sig-email&utm_content=webmail>Virus-free.www.avg.com < http://www.avg.com/email-signature?utm_medium=email&utm_source=link&utm_campaign=sig-email&utm_content=webmail><#DAB4FAD8-2DD7-40BB-A1B8-4E2AA1F9FDF2 >