jpountz commented on code in PR #13470: URL: https://github.com/apache/lucene/pull/13470#discussion_r1631987192
########## lucene/core/src/java/org/apache/lucene/search/TopDocs.java: ########## @@ -350,4 +354,38 @@ private static TopDocs mergeAux( return new TopFieldDocs(totalHits, hits, sort.getSort()); } } + + /** Reciprocal Rank Fusion method. */ + public static TopDocs rrf(int TopN, int k, TopDocs[] hits) { + Map<Integer, Float> rrfScore = new HashMap<>(); + long minHits = Long.MAX_VALUE; + for (TopDocs topDoc : hits) { + minHits = Math.min(minHits, topDoc.totalHits.value); + Map<Integer, Float> scoreMap = new HashMap<>(); + for (ScoreDoc scoreDoc : topDoc.scoreDocs) { + scoreMap.put(scoreDoc.doc, scoreDoc.score); + } + + List<Map.Entry<Integer, Float>> scoreList = new ArrayList<>(scoreMap.entrySet()); + scoreList.sort(Map.Entry.comparingByValue()); Review Comment: We don't seem to be using these `scoreMap` and `scoreList` anywhere? ########## lucene/core/src/java/org/apache/lucene/search/TopDocs.java: ########## @@ -350,4 +354,38 @@ private static TopDocs mergeAux( return new TopFieldDocs(totalHits, hits, sort.getSort()); } } + + /** Reciprocal Rank Fusion method. */ + public static TopDocs rrf(int TopN, int k, TopDocs[] hits) { + Map<Integer, Float> rrfScore = new HashMap<>(); + long minHits = Long.MAX_VALUE; + for (TopDocs topDoc : hits) { + minHits = Math.min(minHits, topDoc.totalHits.value); + Map<Integer, Float> scoreMap = new HashMap<>(); + for (ScoreDoc scoreDoc : topDoc.scoreDocs) { + scoreMap.put(scoreDoc.doc, scoreDoc.score); + } + + List<Map.Entry<Integer, Float>> scoreList = new ArrayList<>(scoreMap.entrySet()); + scoreList.sort(Map.Entry.comparingByValue()); + + int rank = 1; + for (ScoreDoc scoreDoc : topDoc.scoreDocs) { + rrfScore.put(scoreDoc.doc, rrfScore.getOrDefault(scoreDoc.doc, 0.0f) + 1.0f / (rank + k)); + rank++; + } + } + + List<Map.Entry<Integer, Float>> rrfScoreRank = new ArrayList<>(rrfScore.entrySet()); + rrfScoreRank.sort( + Map.Entry.<Integer, Float>comparingByValue().reversed()); // Sort in descending order + + ScoreDoc[] rrfScoreDocs = new ScoreDoc[Math.min(TopN, rrfScoreRank.size())]; + for (int i = 0; i < rrfScoreDocs.length; i++) { + rrfScoreDocs[i] = new ScoreDoc(rrfScoreRank.get(i).getKey(), rrfScoreRank.get(i).getValue()); Review Comment: Nit: we should preserve the original `shardIndex` that is configured on the `ScoreDoc` object that identifies the shard that it is coming from. ########## lucene/core/src/java/org/apache/lucene/search/TopDocs.java: ########## @@ -350,4 +354,38 @@ private static TopDocs mergeAux( return new TopFieldDocs(totalHits, hits, sort.getSort()); } } + + /** Reciprocal Rank Fusion method. */ Review Comment: Users could use more details in these javadocs, e.g. what are `k` and `topN`? ########## lucene/core/src/java/org/apache/lucene/search/TopDocs.java: ########## @@ -350,4 +354,38 @@ private static TopDocs mergeAux( return new TopFieldDocs(totalHits, hits, sort.getSort()); } } + + /** Reciprocal Rank Fusion method. */ + public static TopDocs rrf(int TopN, int k, TopDocs[] hits) { + Map<Integer, Float> rrfScore = new HashMap<>(); + long minHits = Long.MAX_VALUE; + for (TopDocs topDoc : hits) { + minHits = Math.min(minHits, topDoc.totalHits.value); Review Comment: I wonder if it should be a `max` rather than a `min`? Presumably, hits from either top hits are considered as hits globally, and we are just using RRF to boost hits that are ranked high in multiple top hits? ########## lucene/core/src/java/org/apache/lucene/search/TopDocs.java: ########## @@ -350,4 +354,38 @@ private static TopDocs mergeAux( return new TopFieldDocs(totalHits, hits, sort.getSort()); } } + + /** Reciprocal Rank Fusion method. */ + public static TopDocs rrf(int TopN, int k, TopDocs[] hits) { + Map<Integer, Float> rrfScore = new HashMap<>(); Review Comment: Note that we should identify documents not only by their doc ID, but by the combination of `ScoreDoc.shardIndex` and `ScoreDoc.doc`, as there could be multiple documents that have the same doc ID but come from different shards. ########## lucene/core/src/java/org/apache/lucene/search/TopDocs.java: ########## @@ -350,4 +354,38 @@ private static TopDocs mergeAux( return new TopFieldDocs(totalHits, hits, sort.getSort()); } } + + /** Reciprocal Rank Fusion method. */ + public static TopDocs rrf(int TopN, int k, TopDocs[] hits) { Review Comment: nit: function arguments should use lower camel case ```suggestion public static TopDocs rrf(int topN, int k, TopDocs[] hits) { ``` ########## lucene/core/src/java/org/apache/lucene/search/TopDocs.java: ########## @@ -350,4 +354,38 @@ private static TopDocs mergeAux( return new TopFieldDocs(totalHits, hits, sort.getSort()); } } + + /** Reciprocal Rank Fusion method. */ + public static TopDocs rrf(int TopN, int k, TopDocs[] hits) { + Map<Integer, Float> rrfScore = new HashMap<>(); + long minHits = Long.MAX_VALUE; + for (TopDocs topDoc : hits) { + minHits = Math.min(minHits, topDoc.totalHits.value); + Map<Integer, Float> scoreMap = new HashMap<>(); + for (ScoreDoc scoreDoc : topDoc.scoreDocs) { + scoreMap.put(scoreDoc.doc, scoreDoc.score); + } + + List<Map.Entry<Integer, Float>> scoreList = new ArrayList<>(scoreMap.entrySet()); + scoreList.sort(Map.Entry.comparingByValue()); + + int rank = 1; + for (ScoreDoc scoreDoc : topDoc.scoreDocs) { + rrfScore.put(scoreDoc.doc, rrfScore.getOrDefault(scoreDoc.doc, 0.0f) + 1.0f / (rank + k)); Review Comment: Use `Map#compute` instead of `getOrDefault` + `put`? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org