[GitHub] [lucene-solr] mikemccand commented on a change in pull request #2097: LUCENE-9537

GitBox Sun, 29 Nov 2020 07:19:20 -0800


mikemccand commented on a change in pull request #2097:
URL: https://github.com/apache/lucene-solr/pull/2097#discussion_r532220806




##########
File path: lucene/core/src/java/org/apache/lucene/search/CachingCollector.java
##########
@@ -59,6 +59,11 @@
     @Override
     public final float score() { return score; }
 
+    @Override
+    public float smoothingScore(int docId) throws IOException {
+      return 0;

Review comment:
       Hmm, should we also cache the `smoothingScore` for this hit?
   
   Or, if we will keep it at returning `0`, couldn't we remove this impl and 
inherit the default from `Scorable`?

##########
File path: lucene/core/src/java/org/apache/lucene/search/IndriAndScorer.java
##########
@@ -0,0 +1,59 @@
+package org.apache.lucene.search;
+
+import java.io.IOException;
+import java.util.List;
+
+/** Combines scores of subscorers. If a subscorer does not contain
+ *  the docId, a smoothing score is calculated for that 
+ *  document/subscorer combination.
+ */
+public class IndriAndScorer extends IndriDisjunctionScorer {

Review comment:
       Do we expect to have other scorers that need to extend the 
`IndriDisjunctionScorer`?  Here we have only one such subclass, so I wonder if 
we could just fold all of this into a single `IndriDisjunctionScorer` class?  
Or, are you laying bedrock for future additional query types?

##########
File path: lucene/core/src/java/org/apache/lucene/search/FilterScorable.java
##########
@@ -46,6 +46,11 @@ public float score() throws IOException {
     return in.score();
   }
 
+  @Override
+  public float smoothingScore(int docId) throws IOException {
+    return 0;

Review comment:
       Hmm should this be `return in.smoothingScore(docId)` instead?

##########
File path: lucene/core/src/java/org/apache/lucene/search/IndriAndQuery.java
##########
@@ -0,0 +1,22 @@
+package org.apache.lucene.search;
+
+import java.io.IOException;
+import java.util.List;
+
+/** A Query that matches documents matching combinations of 
+ * {@link TermQuery}s or other IndriAndQuerys.
+ */
+public class IndriAndQuery extends IndriQuery {
+  
+  public IndriAndQuery(List<BooleanClause> clauses) {

Review comment:
       In general, `BooleanClause` can hold any Lucene `Query` implementation, 
but it looks like we are only supporting `TermQuery` and other `IndriAndQuery` 
clauses (just from reading the javadoc)?  If so, should we check/enforce this?

##########
File path: lucene/core/src/java/org/apache/lucene/search/IndriAndQuery.java
##########
@@ -0,0 +1,22 @@
+package org.apache.lucene.search;

Review comment:
       Could you add the standard Apache copyright header here and in all the 
new classes?

##########
File path: lucene/core/src/java/org/apache/lucene/search/IndriAndScorer.java
##########
@@ -0,0 +1,59 @@
+package org.apache.lucene.search;
+
+import java.io.IOException;
+import java.util.List;
+
+/** Combines scores of subscorers. If a subscorer does not contain
+ *  the docId, a smoothing score is calculated for that 
+ *  document/subscorer combination.
+ */
+public class IndriAndScorer extends IndriDisjunctionScorer {
+  
+  protected IndriAndScorer(Weight weight, List<Scorer> subScorers,
+      ScoreMode scoreMode, float boost) throws IOException {
+    super(weight, subScorers, scoreMode, boost);
+  }
+  
+  @Override
+  public float score(List<Scorer> subScorers) throws IOException {
+    int docId = this.docID();
+    return scoreDoc(subScorers, docId);
+  }
+  
+  @Override
+  public float smoothingScore(List<Scorer> subScorers, int docId)
+      throws IOException {
+    return scoreDoc(subScorers, docId);
+  }
+  
+  private float scoreDoc(List<Scorer> subScorers, int docId)
+      throws IOException {
+    double score = 0;
+    double boostSum = 0.0;
+    for (Scorer scorer : subScorers) {
+      if (scorer instanceof IndriScorer) {
+        IndriScorer indriScorer = (IndriScorer) scorer;
+        int scorerDocId = indriScorer.docID();
+        //If the query exists in the document, score the document
+        //Otherwise, compute a smoothing score, which acts like an idf
+        //for subqueries/terms
+        if (docId == scorerDocId) {

Review comment:
       Can you factor out the 2nd two statements under each of the `true` and 
`false` clauses here?  I.e., only the first line needs to be conditional?

##########
File path: 
lucene/core/src/java/org/apache/lucene/search/IndriDisjunctionScorer.java
##########
@@ -0,0 +1,65 @@
+package org.apache.lucene.search;
+
+import java.io.IOException;
+import java.util.List;
+
+/**
+ * The Indri implemenation of a disjunction scorer which stores the subscorers
+ * for for the child queries. The score and smoothingScore methods use the list

Review comment:
       Remove extra `for`?

##########
File path: lucene/core/src/java/org/apache/lucene/search/IndriScorer.java
##########
@@ -0,0 +1,37 @@
+package org.apache.lucene.search;
+
+import java.io.IOException;
+
+/**
+ * The Indri parent scorer that stores the boost so that 
+ * IndriScorers can use the boost outside of the term.
+ *
+ */
+abstract public class IndriScorer extends Scorer {
+
+       private float boost;

Review comment:
       Hmm is this (to store `boost`) the only reason to have a separate 
`IndriScorer`?  If I remember right, Lucene used to apply `boost` similarly 
(every scorer kept track of it) but at one point we moved all boosting to a 
dedicated `BoostQuery`.

##########
File path: lucene/core/src/java/org/apache/lucene/search/Scorable.java
##########
@@ -30,6 +30,13 @@
    * Returns the score of the current document matching the query.
    */
   public abstract float score() throws IOException;
+  
+  /**
+   * Returns the smoothing score of the current document matching the query.
+   * This score is used when the query/term does not appear in the document.
+   * This can return 0 or a smoothing score.

Review comment:
       Maybe link to Indri paper that describes/motivates this?

##########
File path: lucene/core/src/java/org/apache/lucene/search/IndriAndQuery.java
##########
@@ -0,0 +1,22 @@
+package org.apache.lucene.search;
+
+import java.io.IOException;
+import java.util.List;
+
+/** A Query that matches documents matching combinations of 
+ * {@link TermQuery}s or other IndriAndQuerys.
+ */
+public class IndriAndQuery extends IndriQuery {

Review comment:
       Normally in Lucene `AND` implies `MUST`, i.e. required clauses.
   
   But this query is actually disjunctive, right?  Documents will match even if 
they are missing some of the terms.  Should we name it `IndriOrQuery` maybe?  
Or, `IndriBooleanQuery`?

##########
File path: 
lucene/core/src/java/org/apache/lucene/search/similarities/IndriDirichletSimilarity.java
##########
@@ -0,0 +1,108 @@
+/*
+ * 
===============================================================================================
+ * Copyright (c) 2019 Carnegie Mellon University and University of 
Massachusetts. All Rights
+ * Reserved.
+ *
+ * Use of the Lemur Toolkit for Language Modeling and Information Retrieval is 
subject to the terms
+ * of the software license set forth in the LICENSE file included with this 
software, and also
+ * available at http://www.lemurproject.org/license.html
+ *
+ * 
================================================================================================
+ */
+package org.apache.lucene.search.similarities;
+
+import java.util.List;
+import java.util.Locale;
+
+import org.apache.lucene.search.Explanation;
+import org.apache.lucene.search.similarities.BasicStats;
+import org.apache.lucene.search.similarities.LMSimilarity;
+
+/**
+ * Bayesian smoothing using Dirichlet priors as implemented in the Indri Search
+ * engine (http://www.lemurproject.org/indri.php). Indri Dirichelet Smoothing!
+ * tf_E + mu*P(t|D) P(t|E)= ------------------------ documentLength + 
documentMu
+ * mu*P(t|C) + tf_D where P(t|D)= --------------------- doclen + mu
+ */

Review comment:
       Maybe, add a few words giving some intuition about what `mu` does?  It 
looks like it roughly lets you tune how important document length is in the 
scoring?
   
   Also, the formatting of the above equations looks like it got garbled?  You 
will need to use html/javadoc markup to make the formatting survive future 
developers viewing in browser...




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[GitHub] [lucene-solr] mikemccand commented on a change in pull request #2097: LUCENE-9537

Reply via email to