Any idea when you're going to post a snapshot of your 675 stuff, Grant? -----Original Message----- From: Grant Ingersoll [mailto:[EMAIL PROTECTED] Sent: Wednesday, October 18, 2006 1:18 PM To: [email protected] Subject: Re: [jira] Commented: (LUCENE-687) Performance improvement: Lazy skipping on proximity file
Can you share your performance test as well as the results? http://issues.apache.org/jira/browse/LUCENE-675 Thanks, Grant On Oct 18, 2006, at 3:41 PM, Michael Busch (JIRA) wrote: > [ http://issues.apache.org/jira/browse/LUCENE-687? > page=comments#action_12443343 ] > > Michael Busch commented on LUCENE-687: > -------------------------------------- > > Hi Yonik, > > thanks for the quick reply! I'm going to do performance tests and > will give you some numbers soon. > >> Performance improvement: Lazy skipping on proximity file >> -------------------------------------------------------- >> >> Key: LUCENE-687 >> URL: http://issues.apache.org/jira/browse/LUCENE-687 >> Project: Lucene - Java >> Issue Type: Improvement >> Components: Index >> Reporter: Michael Busch >> Priority: Minor >> Attachments: lazy_prox_skipping.patch >> >> >> Hello, >> I'm proposing a patch here that changes >> org.apache.lucene.index.SegmentTermPositions to avoid unnecessary >> skips and reads on the proximity stream. Currently a call of next >> () or seek(), which causes a movement to a document in the freq >> file also moves the prox pointer to the posting list of that >> document. But this is only necessary if actual positions have to >> be retrieved for that particular document. >> Consider for example a phrase query with two terms: the freq >> pointer for term 1 has to move to document x to answer the >> question if the term occurs in that document. But *only* if term 2 >> also matches document x, the positions have to be read to figure >> out if term 1 and term 2 appear next to each other in document x >> and thus satisfy the query. >> A move to the posting list of a document can be quite expensive. >> It has to be skipped to the last skip point before that document >> and then the documents between the skip point and the desired >> document have to be scanned, which means that the VInts of all >> positions of those documents have to be read and decoded. >> An improvement is to move the prox pointer lazily to a document >> only if nextPosition() is called. This will become even more >> important in the future when the size of the proximity file >> increases (e. g. by adding payloads to the posting lists). >> My patch implements this lazy skipping. All unit tests pass. >> I also attach a new unit test that works as follows: >> Using a RamDirectory an index is created and test docs are added. >> Then the index is optimized to make sure it only has a single >> segment. This is important, because IndexReader.open() returns an >> instance of SegmentReader if there is only one segment in the >> index. The proxStream instance of SegmentReader is package >> protected, so it is possible to set proxStream to a different >> object. I am using a class called SeeksCountingStream that extends >> IndexInput in a way that it is able to count the number of >> invocations of seek(). >> Then the testcase searches the index using a PhraseQuery "term1 >> term2". It is known how many documents match that query and the >> testcase can verify that seek() on the proxStream is not called >> more often than number of search hits. >> Example: >> Number of docs in the index: 500 >> Number of docs that match the query "term1 term2": 5 >> Invocations of seek on prox stream (old code): 29 >> Invocations of seek on prox stream (patched version): 5 >> - Michael > > -- > This message is automatically generated by JIRA. > - > If you think it was sent incorrectly contact one of the > administrators: http://issues.apache.org/jira/secure/ > Administrators.jspa > - > For more information on JIRA, see: http://www.atlassian.com/ > software/jira > > > > --------------------------------------------------------------------- > To unsubscribe, e-mail: [EMAIL PROTECTED] > For additional commands, e-mail: [EMAIL PROTECTED] > -------------------------- Grant Ingersoll Sr. Software Engineer Center for Natural Language Processing Syracuse University 335 Hinds Hall Syracuse, NY 13244 http://www.cnlp.org --------------------------------------------------------------------- To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] --------------------------------------------------------------------- To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
