Dmitry Serebrennikov wrote:
1) Since I do not need the intermediate terms, it makes sence to try to have a method that skips to the right term without creating the intermediate Term objects. I have done a version of this yesterday and ended up seeing a factor of 2 performance encrease and a factor of 2 garbage reduction. The patch adds the following method to Term.java:
final int compareTo(String otherField, char[] otherText, int start, int len)
And changes SegmentTermEnum.java to delay creation of Term object until call to term().
Full diff is attached. Any comments are welcome, especially if I've missed something.

Looks reasonable to me. Does it still pass all of the unit tests?


3) I found a piece of code in TermInfosReader.java that uses a field SegmentTermEnum.prev to try to optimize seeks. It looks like this code was put in after the original SegmentTermEnum was finished. I can't find any record of this change in Jakarta's CVS, so probably it was done prior to moving to Jakarta. Does anyone remember why this is here? Does it actually serve a useful purpose? It seems that the condition this code is testing for would not really occur. Perhaps I'm missing something. Here's the code fragment that uses the .prev field:

/** Returns the TermInfo for a Term in the set, or null. */
final synchronized TermInfo get(Term term) throws IOException {
if (size == 0) return null;
// optimize sequential access: first try scanning cached enum w/o seeking
if (enum.term() != null // term is at or past current
&& ((enum.prev != null && term.compareTo(enum.prev) > 0)
|| term.compareTo(enum.term()) >= 0)) {
int enumOffset = (enum.position/TermInfosWriter.INDEX_INTERVAL)+1;
if (indexTerms.length == enumOffset // but before end of block
|| term.compareTo(indexTerms[enumOffset]) < 0)
return scanEnum(term); // no need to seek
}
// random-access: must seek
seekEnum(getIndexOffset(term));
return scanEnum(term);
}

If you put a print statement in this and run the unit tests you'll see that this optimization fires a lot. If, e.g., one expands a wildcarded string into a bunch of terms, which are near one another in the enum, then subsequently asks for the frequency of each term (to weight it in a query), and then, in a third pass, ask for its TermDocs, then each of these latter passes benefit from this optimization. So let's not lose it.


Doug


--------------------------------------------------------------------- To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]



Reply via email to