[jira] Created: (LUCENE-506) Optimize Memory Use for Short-Lived Indexes (Do not load TermInfoIndex if you know the queries ahead of time)

Steven Tamm (JIRA) Wed, 01 Mar 2006 16:11:03 -0800

Optimize Memory Use for Short-Lived Indexes (Do not load TermInfoIndex if you 
know the queries ahead of time)
-------------------------------------------------------------------------------------------------------------


         Key: LUCENE-506
         URL: http://issues.apache.org/jira/browse/LUCENE-506
     Project: Lucene - Java
        Type: Improvement
  Components: Index  
    Versions: 2.0    
 Environment: Patch against Lucene 1.9 trunk as of Mar 1 06
    Reporter: Steven Tamm


Summary: Provide a way to avoid loading the TermInfoIndex into memory if you 
know all the terms you are ever going to query.

In our search environment, we have a large number of indexes (many thousands), 
any of which may be queried by any number of hosts.  These indexes may be very 
large (~1M document), but since we have a low term/doc ratio, we have 7-11M 
terms.  With an index interval of 128, that means ~70-90K terms.  On loading 
the index, it instantiates a Term, a TermInfo, a String, and a char[].  When 
the document is long lived, this makes some sense because you can quickly 
search the list of terms using binary search.  However, since we throw away the 
Indexes very often, a lot of garbage is created per query

Here's an example where we load a large index 10 times.  This corresponds to 
7MB of garbage per query.
          percent          live          alloc'ed  stack class
 rank   self  accum     bytes objs     bytes  objs trace name
    1  4.48%  4.48%   4678736 128946  23393680 644730 387749 char[]
    3  3.95% 12.61%   4126272 128946  20631360 644730 387751 
org.apache.lucene.index.TermInfo
    6  2.96% 22.71%   3094704 128946  15473520 644730 387748 java.lang.String
    8  1.98% 26.97%   2063136 128946  10315680 644730 387750 
org.apache.lucene.index.Term

This adds up after a while.  Since we know exactly which Terms we're going to 
search for before even opening the index, there's no need to allocate this much 
memory.  Upon opening the index, we can go through the TII in sequential order 
and retrieve the entries into the main term dictionary and reduce the storage 
requirements dramatically.  This reduces the amount of garbage generated by 
querying by about 60% if you only make 1 query/index with a 77% increase in 
throughput.

This is accomplished by factoring out the "index loading" aspects of 
TermInfosReader into a new file, SegmentTermInfosReader.  TermInfosReader 
becomes a base class to allow access to terms.  A new class, 
PrefetchedTermInfosReader will, upon startup, sort the passed in terms and 
retrieve the IndexEntries for those terms.  IndexReader and SegmentReader are 
modified to take new constructor methods that take a Collection of Terms that 
correspond to the total set of terms that will ever be searched in the life of 
the index.

In order to support the "skipping" behavior, some changes need to be made to 
SegmentTermEnum: specifically, we need to be able to go back an entry in order 
to retrieve the previous TermInfo and IndexPointer.  This is because, unlike 
the normal case, with the index  we want to return the value right before the 
intended field (so that we can be behind the desired termin the main 
dictionary).   For example, if we're looking for  "apple" in the index,  and 
the two adjacent values are "abba" and "argon", we want to return "abba" 
instead of "argon".  That way we won't miss any terms in the real index.   This 
code is confusing; it should probably be moved to an subclass of TermBuffer, 
but that required more code.  Not wanting to modify TermBuffer to keep it 
small, also lead to the odd NPE catch in SegmentTermEnum.java.  Stickler for 
contracts may want to rename SegmentTermEnum.skipTo() to a different name 
because it implements a different contract: but it would be useful for anyone 
trying to skip around in the TII, so I figured it was the right thing to do.

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators:
   http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see:
   http://www.atlassian.com/software/jira


---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

[jira] Created: (LUCENE-506) Optimize Memory Use for Short-Lived Indexes (Do not load TermInfoIndex if you know the queries ahead of time)

Reply via email to