[GitHub] mikemccand commented on a change in pull request #579: PET: prorated early termination

2019-02-19 Thread GitBox
mikemccand commented on a change in pull request #579: PET: prorated early 
termination
URL: https://github.com/apache/lucene-solr/pull/579#discussion_r258156667
 
 

 ##
 File path: lucene/core/src/java/org/apache/lucene/search/IndexSearcher.java
 ##
 @@ -443,16 +443,10 @@ public void search(Query query, Collector results)
 search(leafContexts, createWeight(query, results.scoreMode(), 1), results);
   }
 
-  /** Search implementation with arbitrary sorting, plus
-   * control over whether hit scores and max score
-   * should be computed.  Finds
-   * the top n hits for query, and sorting
-   * the hits by the criteria in sort.
-   * If doDocScores is true
-   * then the score of each hit will be computed and
-   * returned.  If doMaxScore is
-   * true then the maximum score over all
-   * collected hits will be computed.
+  /** Search implementation with arbitrary sorting, plus control over whether 
hit scores should be
+   * computed.  Finds the top n hits for query, and 
sorting the hits by
+   * the criteria in sort.  If doDocScores is 
true then the
+   * score of each hit will be computed and returned.
 
 Review comment:
   Thanks for cleaning up our stale javadocs!


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[GitHub] mikemccand commented on a change in pull request #579: PET: prorated early termination

2019-02-19 Thread GitBox
mikemccand commented on a change in pull request #579: PET: prorated early 
termination
URL: https://github.com/apache/lucene-solr/pull/579#discussion_r258162824
 
 

 ##
 File path: 
lucene/core/src/test/org/apache/lucene/search/TestTopFieldCollectorEarlyTermination.java
 ##
 @@ -234,13 +242,189 @@ public void testCanEarlyTerminateOnPrefix() {
 new Sort(new SortField("c", SortField.Type.LONG), new SortField("b", 
SortField.Type.STRING;
   }
 
+  private Document numberedDocument(int num, int iterm) {
+final Document doc = new Document();
+doc.add(new NumericDocValuesField("ndv1", num));
+doc.add(new StringField("s", terms.get(iterm % terms.size()), Store.YES));
+return doc;
+  }
+
+  private void createRandomTerms() {
+final int numTerms = TestUtil.nextInt(random(), 1, numDocs / 5);
+Set randomTerms = new HashSet<>();
+while (randomTerms.size() < numTerms) {
+  randomTerms.add(TestUtil.randomSimpleString(random()));
+}
+terms = new ArrayList<>(randomTerms);
+  }
+
+  private void createUniformIndexX() throws IOException {
+dir = newDirectory();
+numDocs = atLeast(150);
+int numSegs = atLeast(5);
+// Create segments of random pre-determined sizes so we can distribute the 
documents uniformly
+// among them
+int docsRemaining = numDocs;
+List segmentSizes = new ArrayList<>();
+for (int i = 0; i < numSegs - 1; i++) {
+  int size = random().nextInt(docsRemaining - numSegs + i);
+  segmentSizes.add(size);
+  docsRemaining -= size;
+}
+segmentSizes.add(docsRemaining);
+List> segDocs = new ArrayList<>();
+for (int i = 0; i < numSegs; i++) {
+  segDocs.add(new ArrayList<>());
+}
+createRandomTerms();
+List> segDocsToFill = new ArrayList<>(segDocs);
+for (int seg = 0, i = 0, j = 0; i < numDocs; ++i) {
+  // Create documents with the sort key and terms uniformly distributed 
among segments
+  seg %= segDocsToFill.size();
+  if (seg == 0) {
+// this causes equal numbers of docs with "score" j to be added to 
each segment that has at least j documents
+// TODO: sometimes do not increment j (so we get more random setup), 
but we must increment it when complete a segment
+++j;
+  }
+  List docs = segDocsToFill.get(seg);
+  docs.add(numberedDocument(j, j));
+  if (docs.size() == segmentSizes.get(seg)) {
+segmentSizes.remove(seg);
+segDocsToFill.remove(seg);
+  } else {
+++seg;
+  }
+}
+final long seed = random().nextLong();
+final IndexWriterConfig iwc = newIndexWriterConfig(new MockAnalyzer(new 
Random(seed)));
+// one segment per commit so we can control the segment sizes
+iwc.setMergePolicy(NoMergePolicy.INSTANCE);
+iwc.setIndexSort(sort);
+iw = new RandomIndexWriter(new Random(seed), dir, iwc);
+for (int seg = 0; seg < segDocs.size(); seg++) {
+  for (Document doc : segDocs.get(seg)) {
+iw.addDocument(doc);
+  }
+  iw.commit();
+}
+reader = iw.getReader();
+  }
+
+  private void createUniformIndex() throws IOException {
+dir = newDirectory();
+numDocs = atLeast(150);
+int numSegs = atLeast(5);
+// Create segments of random pre-determined sizes so we can distribute the 
documents uniformly
+// among them
+int docsRemaining = numDocs;
+List segmentSizes = new ArrayList<>();
+for (int i = 0; i < numSegs - 1; i++) {
+  int size = random().nextInt(docsRemaining - numSegs + i);
+  segmentSizes.add(size);
+  docsRemaining -= size;
+}
+segmentSizes.add(docsRemaining);
+createRandomTerms();
+final long seed = random().nextLong();
+final IndexWriterConfig iwc = newIndexWriterConfig(new MockAnalyzer(new 
Random(seed)));
+// one segment per commit so we can control the segment sizes
+iwc.setMergePolicy(NoMergePolicy.INSTANCE);
+iwc.setMaxBufferedDocs(numDocs);
+iwc.setRAMBufferSizeMB(IndexWriterConfig.DISABLE_AUTO_FLUSH);
+iwc.setIndexSort(sort);
+writer = new IndexWriter(dir, iwc);
+for (int seg = 0; seg < numSegs; seg++) {
+  int size = segmentSizes.get(seg);
+  double step = numDocs / (double) size;
+  for (int i = 0; i < size; i++) {
+int num = (int) Math.round(i * step);
+Document doc = numberedDocument(num, num);
+writer.addDocument(doc);
+  }
+  writer.commit();
+}
+reader = DirectoryReader.open(writer);
+  }  
+
+  private void createSkewedIndex() throws IOException {
+dir = newDirectory();
+numDocs = atLeast(150);
+createRandomTerms();
+final long seed = random().nextLong();
+final IndexWriterConfig iwc = newIndexWriterConfig(new MockAnalyzer(new 
Random(seed)));
+// one segment per commit so we can control the segment sizes
+iwc.setMergePolicy(NoMergePolicy.INSTANCE);
+iwc.setIndexSort(sort);
+writer = new IndexWriter(dir, iwc);
+for (int i = 0; i < 

[GitHub] mikemccand commented on a change in pull request #579: PET: prorated early termination

2019-02-19 Thread GitBox
mikemccand commented on a change in pull request #579: PET: prorated early 
termination
URL: https://github.com/apache/lucene-solr/pull/579#discussion_r258160398
 
 

 ##
 File path: lucene/core/src/java/org/apache/lucene/search/TopFieldCollector.java
 ##
 @@ -165,11 +169,35 @@ public void collect(int doc) throws IOException {
   updateMinCompetitiveScore(scorer);
 }
   }
+  if (canEarlyTerminate) {
+  // When early terminating, stop collecting hits from this leaf 
once we have its prorated hits.
+  if (leafHits > leafHitsThreshold) {
+  totalHitsRelation = Relation.GREATER_THAN_OR_EQUAL_TO;
+  throw new CollectionTerminatedException();
+  }
+  }
 }
 
   };
 }
 
+/** The total number of documents that matched this query; may be a lower 
bound in case of early termination. */
+@Override
+public int getTotalHits() {
+  return totalHits;
+}
+
+private int prorateForSegment(int topK, LeafReaderContext leafCtx) {
+// prorate number of hits to collect based on proportion of documents 
in this leaf (segment).
+// p := probability of a top-k document (or any document) being in 
this segment
+double p = (double) leafCtx.reader().numDocs() / 
leafCtx.parent.reader().numDocs();
+// m := expected number of the topK results in this segment
+double m = p * topK;
+// Increase N to include a bound to ensure the probability of missing 
a doc is very small
 
 Review comment:
   Maybe add a short description describing the math / distribution here, 
explaining the equation below?


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[GitHub] mikemccand commented on a change in pull request #579: PET: prorated early termination

2019-02-19 Thread GitBox
mikemccand commented on a change in pull request #579: PET: prorated early 
termination
URL: https://github.com/apache/lucene-solr/pull/579#discussion_r258158795
 
 

 ##
 File path: lucene/core/src/java/org/apache/lucene/search/TopFieldCollector.java
 ##
 @@ -100,8 +101,8 @@ private static boolean canEarlyTerminateOnPrefix(Sort 
searchSort, Sort indexSort
 final Sort sort;
 final FieldValueHitQueue queue;
 
-public SimpleFieldCollector(Sort sort, FieldValueHitQueue queue, 
int numHits, int totalHitsThreshold) {
-  super(queue, numHits, totalHitsThreshold, sort.needsScores());
+public SimpleFieldCollector(Sort sort, FieldValueHitQueue queue, 
int numHits, int totalHitsThreshold, int perSegmentMargin) {
 
 Review comment:
   Maybe add javadocs here linking to the nice docs in the `create` methods 
below?


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[GitHub] mikemccand commented on a change in pull request #579: PET: prorated early termination

2019-02-19 Thread GitBox
mikemccand commented on a change in pull request #579: PET: prorated early 
termination
URL: https://github.com/apache/lucene-solr/pull/579#discussion_r258158151
 
 

 ##
 File path: lucene/core/src/java/org/apache/lucene/search/IndexSearcher.java
 ##
 @@ -525,25 +517,8 @@ private TopFieldDocs searchAfter(FieldDoc after, Query 
query, int numHits, Sort
 final int cappedNumHits = Math.min(numHits, limit);
 final Sort rewrittenSort = sort.rewrite(this);
 
-final CollectorManager manager = new 
CollectorManager() {
-
-  @Override
-  public TopFieldCollector newCollector() throws IOException {
-// TODO: don't pay the price for accurate hit counts by default
-return TopFieldCollector.create(rewrittenSort, cappedNumHits, after, 
TOTAL_HITS_THRESHOLD);
-  }
-
-  @Override
-  public TopFieldDocs reduce(Collection collectors) 
throws IOException {
-final TopFieldDocs[] topDocs = new TopFieldDocs[collectors.size()];
-int i = 0;
-for (TopFieldCollector collector : collectors) {
-  topDocs[i++] = collector.topDocs();
-}
-return TopDocs.merge(rewrittenSort, 0, cappedNumHits, topDocs, true);
-  }
-
-};
+final CollectorManager manager =
+TopFieldCollector.createManager(rewrittenSort, cappedNumHits, after, 
TOTAL_HITS_THRESHOLD, Integer.MAX_VALUE);
 
 Review comment:
   Was this just a straight refactor?  The new way does the same thing as the 
old anonymous class?


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[GitHub] mikemccand commented on a change in pull request #579: PET: prorated early termination

2019-02-19 Thread GitBox
mikemccand commented on a change in pull request #579: PET: prorated early 
termination
URL: https://github.com/apache/lucene-solr/pull/579#discussion_r258163341
 
 

 ##
 File path: lucene/core/src/java/org/apache/lucene/search/TopFieldCollector.java
 ##
 @@ -165,11 +169,35 @@ public void collect(int doc) throws IOException {
   updateMinCompetitiveScore(scorer);
 }
   }
+  if (canEarlyTerminate) {
+  // When early terminating, stop collecting hits from this leaf 
once we have its prorated hits.
+  if (leafHits > leafHitsThreshold) {
+  totalHitsRelation = Relation.GREATER_THAN_OR_EQUAL_TO;
+  throw new CollectionTerminatedException();
+  }
+  }
 }
 
   };
 }
 
+/** The total number of documents that matched this query; may be a lower 
bound in case of early termination. */
+@Override
+public int getTotalHits() {
+  return totalHits;
+}
+
+private int prorateForSegment(int topK, LeafReaderContext leafCtx) {
+// prorate number of hits to collect based on proportion of documents 
in this leaf (segment).
+// p := probability of a top-k document (or any document) being in 
this segment
+double p = (double) leafCtx.reader().numDocs() / 
leafCtx.parent.reader().numDocs();
 
 Review comment:
   Do we need to guard for divide-by-zero here?


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org