[
https://issues.apache.org/jira/browse/LUCENE-506?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Shai Erera closed LUCENE-506.
-----------------------------
Resolution: Fixed
IndexReader.open allows passing termInfosDivisor=-1 and you can set the same on
IndexWriterConfig to prevent loading the term infos.
> Optimize Memory Use for Short-Lived Indexes (Do not load TermInfoIndex if you
> know the queries ahead of time)
> -------------------------------------------------------------------------------------------------------------
>
> Key: LUCENE-506
> URL: https://issues.apache.org/jira/browse/LUCENE-506
> Project: Lucene - Java
> Issue Type: Improvement
> Components: Index
> Affects Versions: 2.0.0
> Environment: Patch against Lucene 1.9 trunk as of Mar 1 06
> Reporter: Steven Tamm
> Priority: Minor
> Attachments: Prefetching.patch
>
>
> Summary: Provide a way to avoid loading the TermInfoIndex into memory if you
> know all the terms you are ever going to query.
> In our search environment, we have a large number of indexes (many
> thousands), any of which may be queried by any number of hosts. These
> indexes may be very large (~1M document), but since we have a low term/doc
> ratio, we have 7-11M terms. With an index interval of 128, that means
> ~70-90K terms. On loading the index, it instantiates a Term, a TermInfo, a
> String, and a char[]. When the document is long lived, this makes some sense
> because you can quickly search the list of terms using binary search.
> However, since we throw away the Indexes very often, a lot of garbage is
> created per query
> Here's an example where we load a large index 10 times. This corresponds to
> 7MB of garbage per query.
> percent live alloc'ed stack class
> rank self accum bytes objs bytes objs trace name
> 1 4.48% 4.48% 4678736 128946 23393680 644730 387749 char[]
> 3 3.95% 12.61% 4126272 128946 20631360 644730 387751
> org.apache.lucene.index.TermInfo
> 6 2.96% 22.71% 3094704 128946 15473520 644730 387748 java.lang.String
> 8 1.98% 26.97% 2063136 128946 10315680 644730 387750
> org.apache.lucene.index.Term
> This adds up after a while. Since we know exactly which Terms we're going to
> search for before even opening the index, there's no need to allocate this
> much memory. Upon opening the index, we can go through the TII in sequential
> order and retrieve the entries into the main term dictionary and reduce the
> storage requirements dramatically. This reduces the amount of garbage
> generated by querying by about 60% if you only make 1 query/index with a 77%
> increase in throughput.
> This is accomplished by factoring out the "index loading" aspects of
> TermInfosReader into a new file, SegmentTermInfosReader. TermInfosReader
> becomes a base class to allow access to terms. A new class,
> PrefetchedTermInfosReader will, upon startup, sort the passed in terms and
> retrieve the IndexEntries for those terms. IndexReader and SegmentReader are
> modified to take new constructor methods that take a Collection of Terms that
> correspond to the total set of terms that will ever be searched in the life
> of the index.
> In order to support the "skipping" behavior, some changes need to be made to
> SegmentTermEnum: specifically, we need to be able to go back an entry in
> order to retrieve the previous TermInfo and IndexPointer. This is because,
> unlike the normal case, with the index we want to return the value right
> before the intended field (so that we can be behind the desired termin the
> main dictionary). For example, if we're looking for "apple" in the index,
> and the two adjacent values are "abba" and "argon", we want to return "abba"
> instead of "argon". That way we won't miss any terms in the real index.
> This code is confusing; it should probably be moved to an subclass of
> TermBuffer, but that required more code. Not wanting to modify TermBuffer to
> keep it small, also lead to the odd NPE catch in SegmentTermEnum.java.
> Stickler for contracts may want to rename SegmentTermEnum.skipTo() to a
> different name because it implements a different contract: but it would be
> useful for anyone trying to skip around in the TII, so I figured it was the
> right thing to do.
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]