[ 
https://issues.apache.org/jira/browse/LUCENE-1483?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12655616#action_12655616
 ] 

Michael McCandless commented on LUCENE-1483:
--------------------------------------------


OK, I ran a quick perf test on a 100 segment index with 1 million docs
(10K docs per segment), for a single TermQuery ("text"), and I'm
seeing 11.1% speedup (best of 4: 20.36s -> 18.11s) with this patch, on
Mac OS X.  On Linux I see 6.3% speedup (best of 4: 23.31s -> 21.84s).

Single segment index shows no difference, as expected.

I think the speedup is due to avoiding the extra method call plus 2nd
pass through the int docs[] to add in the doc base, in
MultiSegmentReader.MultiTermDocs.read(int[] docs, int[] freqs).

This is a nice "side effect", ie in addition to getting faster reopen
performance (the original goal here), we get a bump in single term
search performance.

I think given this, we should cutover other search methods
(sort-by-relevance, custom HitCollector) to this approach?  Maybe if
we add a new Scorer.score method that can accept a "docBase" which it
adds into the doc() before calling collect()?  In fact, if we do that,
we may not even need the new MultiReaderTopFieldDocCollector at all?

Hmm, though, a Scorer may override that score(HitCollector), eg
BooleanScorer does.  Maybe we have to make a wrapper HitCollector that
simply adds in the docBase and then invokes the real
HitCollector.collect after shifting the docBase?  Though that costs us
an extra method call per collect().

Here's the alg I used (slight modified from the one above):
{code}
merge.factor=1000
compound=false

analyzer=org.apache.lucene.analysis.standard.StandardAnalyzer
directory=FSDirectory
#directory=RamDirectory

doc.tokenized=true
doc.term.vector=false
doc.add.log.step=100000
max.buffered=10000
ram.flush.mb=1000

work.dir = /lucene/work

doc.maker=org.apache.lucene.benchmark.byTask.feeds.SortableSimpleDocMaker

query.maker=org.apache.lucene.benchmark.byTask.feeds.FileBasedQueryMaker
file.query.maker.file = test.queries

task.max.depth.log=2

log.queries=true

{ "Populate"
  -CreateIndex
  { "MAddDocs" AddDoc(100) > : 1000000
  -CloseIndex
}
    
{ "Rounds"
  { "Run"
    { "TestSortSpeed"
      OpenReader  
      { "LoadFieldCacheAndSearch" SearchWithSort(sort_field:int) > : 1 
      { "SearchWithSort" SearchWithSort(sort_field) > : 500
      CloseReader 
    }
    NewRound
  } : 4
} 

RepSumByPrefRound SearchWithSort
{code}

It creates the index once, then does 4 rounds of searching with the
single query "text" in test.queries (SimpleQueryMaker was creating
other queries that were getting 0 or 1 hits).

I'm running with "java -Xms1024M -Xmx1024M -Xbatch -server"; java is
1.6.0_07 on Mac Pro OS X 10.5.5 and 1.6.0_10-rc on 2.6.22.1 linux
kernel.


> Change IndexSearcher to use MultiSearcher semantics for sorted searches
> -----------------------------------------------------------------------
>
>                 Key: LUCENE-1483
>                 URL: https://issues.apache.org/jira/browse/LUCENE-1483
>             Project: Lucene - Java
>          Issue Type: Improvement
>    Affects Versions: 2.9
>            Reporter: Mark Miller
>            Priority: Minor
>         Attachments: LUCENE-1483.patch, LUCENE-1483.patch, LUCENE-1483.patch, 
> LUCENE-1483.patch
>
>
> Here is a quick test patch. FieldCache for sorting is done at the individual 
> IndexReader level and reloading the fieldcache on reopen can be much faster 
> as only changed segments need to be reloaded.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Reply via email to