[ http://issues.apache.org/jira/browse/LUCENE-687?page=all ]

Yonik Seeley resolved LUCENE-687.
---------------------------------

    Fix Version/s: 2.1
       Resolution: Fixed
         Assignee: Yonik Seeley

Reviewed and committed.  Thanks Michael!

> Performance improvement: Lazy skipping on proximity file
> --------------------------------------------------------
>
>                 Key: LUCENE-687
>                 URL: http://issues.apache.org/jira/browse/LUCENE-687
>             Project: Lucene - Java
>          Issue Type: Improvement
>          Components: Index
>            Reporter: Michael Busch
>         Assigned To: Yonik Seeley
>            Priority: Minor
>             Fix For: 2.1
>
>         Attachments: lazy_prox_skipping.patch
>
>
> Hello,
> I'm proposing a patch here that changes 
> org.apache.lucene.index.SegmentTermPositions to avoid unnecessary skips and 
> reads on the proximity stream. Currently a call of next() or seek(), which 
> causes a movement to a document in the freq file also moves the prox pointer 
> to the posting list of that document.  But this is only necessary if actual 
> positions have to be retrieved for that particular document. 
> Consider for example a phrase query with two terms: the freq pointer for term 
> 1 has to move to document x to answer the question if the term occurs in that 
> document. But *only* if term 2 also matches document x, the positions have to 
> be read to figure out if term 1 and term 2 appear next to each other in 
> document x and thus satisfy the query. 
> A move to the posting list of a document can be quite expensive. It has to be 
> skipped to the last skip point before that document and then the documents 
> between the skip point and the desired document have to be scanned, which 
> means that the VInts of all positions of those documents have to be read and 
> decoded. 
> An improvement is to move the prox pointer lazily to a document only if 
> nextPosition() is called. This will become even more important in the future 
> when the size of the proximity file increases (e. g. by adding payloads to 
> the posting lists).
> My patch implements this lazy skipping. All unit tests pass. 
> I also attach a new unit test that works as follows:
> Using a RamDirectory an index is created and test docs are added. Then the 
> index is optimized to make sure it only has a single segment. This is 
> important, because IndexReader.open() returns an instance of SegmentReader if 
> there is only one segment in the index. The proxStream instance of 
> SegmentReader is package protected, so it is possible to set proxStream to a 
> different object. I am using a class called SeeksCountingStream that extends 
> IndexInput in a way that it is able to count the number of invocations of 
> seek(). 
> Then the testcase searches the index using a PhraseQuery "term1 term2". It is 
> known how many documents match that query and the testcase can verify that 
> seek() on the proxStream is not called more often than number of search hits.
> Example:
> Number of docs in the index: 500
> Number of docs that match the query "term1 term2": 5
> Invocations of seek on prox stream (old code): 29
> Invocations of seek on prox stream (patched version): 5
> - Michael

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators: 
http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Reply via email to