[ http://issues.apache.org/jira/browse/LUCENE-506?page=all ]
Steven Tamm updated LUCENE-506: ------------------------------- Attachment: Prefetching.patch This also includes two additional test cases. The public exposure to the prefetching is controlled solely by IndexReader.open(Directory,Collection). If you try to query a term that wasn't included, This also includes some wildcard handling. I'm not sure it's absolutely necessary for WildcardTermEnum or FuzzyTermEnum. You can probably remove the entire if (entry == null) block of PrefetchedTermInfosReader.seekEnum. But this provides more flexibility. > Optimize Memory Use for Short-Lived Indexes (Do not load TermInfoIndex if you > know the queries ahead of time) > ------------------------------------------------------------------------------------------------------------- > > Key: LUCENE-506 > URL: http://issues.apache.org/jira/browse/LUCENE-506 > Project: Lucene - Java > Type: Improvement > Components: Index > Versions: 2.0 > Environment: Patch against Lucene 1.9 trunk as of Mar 1 06 > Reporter: Steven Tamm > Attachments: Prefetching.patch > > Summary: Provide a way to avoid loading the TermInfoIndex into memory if you > know all the terms you are ever going to query. > In our search environment, we have a large number of indexes (many > thousands), any of which may be queried by any number of hosts. These > indexes may be very large (~1M document), but since we have a low term/doc > ratio, we have 7-11M terms. With an index interval of 128, that means > ~70-90K terms. On loading the index, it instantiates a Term, a TermInfo, a > String, and a char[]. When the document is long lived, this makes some sense > because you can quickly search the list of terms using binary search. > However, since we throw away the Indexes very often, a lot of garbage is > created per query > Here's an example where we load a large index 10 times. This corresponds to > 7MB of garbage per query. > percent live alloc'ed stack class > rank self accum bytes objs bytes objs trace name > 1 4.48% 4.48% 4678736 128946 23393680 644730 387749 char[] > 3 3.95% 12.61% 4126272 128946 20631360 644730 387751 > org.apache.lucene.index.TermInfo > 6 2.96% 22.71% 3094704 128946 15473520 644730 387748 java.lang.String > 8 1.98% 26.97% 2063136 128946 10315680 644730 387750 > org.apache.lucene.index.Term > This adds up after a while. Since we know exactly which Terms we're going to > search for before even opening the index, there's no need to allocate this > much memory. Upon opening the index, we can go through the TII in sequential > order and retrieve the entries into the main term dictionary and reduce the > storage requirements dramatically. This reduces the amount of garbage > generated by querying by about 60% if you only make 1 query/index with a 77% > increase in throughput. > This is accomplished by factoring out the "index loading" aspects of > TermInfosReader into a new file, SegmentTermInfosReader. TermInfosReader > becomes a base class to allow access to terms. A new class, > PrefetchedTermInfosReader will, upon startup, sort the passed in terms and > retrieve the IndexEntries for those terms. IndexReader and SegmentReader are > modified to take new constructor methods that take a Collection of Terms that > correspond to the total set of terms that will ever be searched in the life > of the index. > In order to support the "skipping" behavior, some changes need to be made to > SegmentTermEnum: specifically, we need to be able to go back an entry in > order to retrieve the previous TermInfo and IndexPointer. This is because, > unlike the normal case, with the index we want to return the value right > before the intended field (so that we can be behind the desired termin the > main dictionary). For example, if we're looking for "apple" in the index, > and the two adjacent values are "abba" and "argon", we want to return "abba" > instead of "argon". That way we won't miss any terms in the real index. > This code is confusing; it should probably be moved to an subclass of > TermBuffer, but that required more code. Not wanting to modify TermBuffer to > keep it small, also lead to the odd NPE catch in SegmentTermEnum.java. > Stickler for contracts may want to rename SegmentTermEnum.skipTo() to a > different name because it implements a different contract: but it would be > useful for anyone trying to skip around in the TII, so I figured it was the > right thing to do. -- This message is automatically generated by JIRA. - If you think it was sent incorrectly contact one of the administrators: http://issues.apache.org/jira/secure/Administrators.jspa - For more information on JIRA, see: http://www.atlassian.com/software/jira --------------------------------------------------------------------- To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]