[jira] Closed: (LUCENE-506) Optimize Memory Use for Short-Lived Indexes (Do not load TermInfoIndex if you know the queries ahead of time)

Shai Erera (JIRA) Wed, 26 Jan 2011 06:19:18 -0800

     [ 
https://issues.apache.org/jira/browse/LUCENE-506?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


Shai Erera closed LUCENE-506.
-----------------------------

    Resolution: Fixed

IndexReader.open allows passing termInfosDivisor=-1 and you can set the same on 
IndexWriterConfig to prevent loading the term infos.

> Optimize Memory Use for Short-Lived Indexes (Do not load TermInfoIndex if you 
> know the queries ahead of time)
> -------------------------------------------------------------------------------------------------------------
>
>                 Key: LUCENE-506
>                 URL: https://issues.apache.org/jira/browse/LUCENE-506
>             Project: Lucene - Java
>          Issue Type: Improvement
>          Components: Index
>    Affects Versions: 2.0.0
>         Environment: Patch against Lucene 1.9 trunk as of Mar 1 06
>            Reporter: Steven Tamm
>            Priority: Minor
>         Attachments: Prefetching.patch
>
>
> Summary: Provide a way to avoid loading the TermInfoIndex into memory if you 
> know all the terms you are ever going to query.
> In our search environment, we have a large number of indexes (many 
> thousands), any of which may be queried by any number of hosts.  These 
> indexes may be very large (~1M document), but since we have a low term/doc 
> ratio, we have 7-11M terms.  With an index interval of 128, that means 
> ~70-90K terms.  On loading the index, it instantiates a Term, a TermInfo, a 
> String, and a char[].  When the document is long lived, this makes some sense 
> because you can quickly search the list of terms using binary search.  
> However, since we throw away the Indexes very often, a lot of garbage is 
> created per query
> Here's an example where we load a large index 10 times.  This corresponds to 
> 7MB of garbage per query.
>           percent          live          alloc'ed  stack class
>  rank   self  accum     bytes objs     bytes  objs trace name
>     1  4.48%  4.48%   4678736 128946  23393680 644730 387749 char[]
>     3  3.95% 12.61%   4126272 128946  20631360 644730 387751 
> org.apache.lucene.index.TermInfo
>     6  2.96% 22.71%   3094704 128946  15473520 644730 387748 java.lang.String
>     8  1.98% 26.97%   2063136 128946  10315680 644730 387750 
> org.apache.lucene.index.Term
> This adds up after a while.  Since we know exactly which Terms we're going to 
> search for before even opening the index, there's no need to allocate this 
> much memory.  Upon opening the index, we can go through the TII in sequential 
> order and retrieve the entries into the main term dictionary and reduce the 
> storage requirements dramatically.  This reduces the amount of garbage 
> generated by querying by about 60% if you only make 1 query/index with a 77% 
> increase in throughput.
> This is accomplished by factoring out the "index loading" aspects of 
> TermInfosReader into a new file, SegmentTermInfosReader.  TermInfosReader 
> becomes a base class to allow access to terms.  A new class, 
> PrefetchedTermInfosReader will, upon startup, sort the passed in terms and 
> retrieve the IndexEntries for those terms.  IndexReader and SegmentReader are 
> modified to take new constructor methods that take a Collection of Terms that 
> correspond to the total set of terms that will ever be searched in the life 
> of the index.
> In order to support the "skipping" behavior, some changes need to be made to 
> SegmentTermEnum: specifically, we need to be able to go back an entry in 
> order to retrieve the previous TermInfo and IndexPointer.  This is because, 
> unlike the normal case, with the index  we want to return the value right 
> before the intended field (so that we can be behind the desired termin the 
> main dictionary).   For example, if we're looking for  "apple" in the index,  
> and the two adjacent values are "abba" and "argon", we want to return "abba" 
> instead of "argon".  That way we won't miss any terms in the real index.   
> This code is confusing; it should probably be moved to an subclass of 
> TermBuffer, but that required more code.  Not wanting to modify TermBuffer to 
> keep it small, also lead to the odd NPE catch in SegmentTermEnum.java.  
> Stickler for contracts may want to rename SegmentTermEnum.skipTo() to a 
> different name because it implements a different contract: but it would be 
> useful for anyone trying to skip around in the TII, so I figured it was the 
> right thing to do.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] Closed: (LUCENE-506) Optimize Memory Use for Short-Lived Indexes (Do not load TermInfoIndex if you know the queries ahead of time)

Reply via email to