[ https://issues.apache.org/jira/browse/LUCENE-1575?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12695853#action_12695853 ]
Shai Erera commented on LUCENE-1575: ------------------------------------ Mike - I've been trying to see this performance hit, without indexing the full wikipedia content, but failed. What I did is: * Create an index with 10M documents with one term (all with the same term). The index had 51 segments (I deliberately aimed at a high number). * Run a sort-by-doc (in reverse order so that collect() will actually do some work) search using TermQuery (so that TermScorer will be used). * Measure the avg. time of 20 query executions (I deliberately omit the first execution because I've noticed that the first run sometimes take much longer, even after I kill the JVM. I guess it has to do with OS caches). The time I measure between the trunk and the current version are very much comparable: trunk-patched: (1) 1248.45 ms (2) 1255.45 ms trunk-orig: (1) 1314.1 ms (2) 1265.65 ms In both runs the patched trunk ran faster. Both did not have any large differences. Here's the code: {code} public static void main(String[] args) throws Exception { File indexDir = new File(args[0]); Directory dir = new FSDirectory(indexDir, new NativeFSLockFactory(indexDir)); if (!indexDir.exists() || indexDir.list().length == 0) { int numDocs = 10000000; IndexWriter writer = new IndexWriter(dir, null, MaxFieldLength.UNLIMITED); writer.setMergeFactor(50); writer.setMaxBufferedDocs(numDocs/100); System.out.println("populating index"); long time = System.currentTimeMillis(); for (int i = 0; i < numDocs; i++) { Document doc = new Document(); doc.add(new Field("c", "test", Store.NO, Index.NOT_ANALYZED)); writer.addDocument(doc); } writer.close(false); System.out.println("time=" + (System.currentTimeMillis() - time)); } System.out.println("searching"); IndexSearcher searcher = new IndexSearcher(dir); System.out.println("numSegments=" + searcher.getIndexReader().getSequentialSubReaders().length); Sort sort = new Sort(new SortField(null, SortField.DOC, true)); searcher.search(new TermQuery(new Term("c", "test")), null, 10, sort); long time = System.currentTimeMillis(); double numQueries = 20; for (int i = 0; i < numQueries; i++) { searcher.search(new TermQuery(new Term("c", "test")), null, 10, sort); } time = System.currentTimeMillis() - time; System.out.println("avg. time=" + (time / numQueries) + " ms"); searcher.close(); } {code} Can you try running this code on your machine? > Refactoring Lucene collectors (HitCollector and extensions) > ----------------------------------------------------------- > > Key: LUCENE-1575 > URL: https://issues.apache.org/jira/browse/LUCENE-1575 > Project: Lucene - Java > Issue Type: Improvement > Components: Search > Reporter: Shai Erera > Assignee: Michael McCandless > Fix For: 2.9 > > Attachments: LUCENE-1575.1.patch, LUCENE-1575.2.patch, > LUCENE-1575.3.patch, LUCENE-1575.4.patch, LUCENE-1575.5.patch, > LUCENE-1575.6.patch, LUCENE-1575.7.patch, LUCENE-1575.patch, > LUCENE-1575.patch, LUCENE-1575.patch, sortBench5.py, sortCollate5.py > > > This issue is a result of a recent discussion we've had on the mailing list. > You can read the thread > [here|http://www.nabble.com/Is-TopDocCollector%27s-collect()-implementation-correct--td22557419.html]. > We have agreed to do the following refactoring: > * Rename MultiReaderHitCollector to Collector, with the purpose that it will > be the base class for all Collector implementations. > * Deprecate HitCollector in favor of the new Collector. > * Introduce new methods in IndexSearcher that accept Collector, and deprecate > those that accept HitCollector. > ** Create a final class HitCollectorWrapper, and use it in the deprecated > methods in IndexSearcher, wrapping the given HitCollector. > ** HitCollectorWrapper will be marked deprecated, so we can remove it in 3.0, > when we remove HitCollector. > ** It will remove any instanceof checks that currently exist in IndexSearcher > code. > * Create a new (abstract) TopDocsCollector, which will: > ** Leave collect and setNextReader unimplemented. > ** Introduce protected members PriorityQueue and totalHits. > ** Introduce a single protected constructor which accepts a PriorityQueue. > ** Implement topDocs() and getTotalHits() using the PQ and totalHits members. > These can be used as-are by extending classes, as well as be overridden. > ** Introduce a new topDocs(start, howMany) method which will be used a > convenience method when implementing a search application which allows paging > through search results. It will also attempt to improve the memory > allocation, by allocating a ScoreDoc[] of the requested size only. > * Change TopScoreDocCollector to extend TopDocsCollector, use the topDocs() > and getTotalHits() implementations as they are from TopDocsCollector. The > class will also be made final. > * Change TopFieldCollector to extend TopDocsCollector, and make the class > final. Also implement topDocs(start, howMany). > * Change TopFieldDocCollector (deprecated) to extend TopDocsCollector, > instead of TopScoreDocCollector. Implement topDocs(start, howMany) > * Review other places where HitCollector is used, such as in Scorer, > deprecate those places and use Collector instead. > Additionally, the following proposal was made w.r.t. decoupling score from > collect(): > * Change collect to accecpt only a doc Id (unbased). > * Introduce a setScorer(Scorer) method. > * If during collect the implementation needs the score, it can call > scorer.score(). > If we do this, then we need to review all places in the code where > collect(doc, score) is called, and assert whether Scorer can be passed. Also > this raises few questions: > * What if during collect() Scorer is null? (i.e., not set) - is it even > possible? > * I noticed that many (if not all) of the collect() implementations discard > the document if its score is not greater than 0. Doesn't it mean that score > is needed in collect() always? > Open issues: > * The name for Collector > * TopDocsCollector was mentioned on the thread as TopResultsCollector, but > that was when we thought to call Colletor ResultsColletor. Since we decided > (so far) on Collector, I think TopDocsCollector makes sense, because of its > TopDocs output. > * Decoupling score from collect(). > I will post a patch a bit later, as this is expected to be a very large > patch. I will split it into 2: (1) code patch (2) test cases (moving to use > Collector instead of HitCollector, as well as testing the new topDocs(start, > howMany) method. > There might be even a 3rd patch which handles the setScorer thing in > Collector (maybe even a different issue?) -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. --------------------------------------------------------------------- To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org