[jira] Updated: (LUCENE-506) Optimize Memory Use for Short-Lived Indexes (Do not load TermInfoIndex if you know the queries ahead of time)

Steven Tamm (JIRA) Wed, 01 Mar 2006 16:15:03 -0800

     [ http://issues.apache.org/jira/browse/LUCENE-506?page=all ]


Steven Tamm updated LUCENE-506:
-------------------------------

    Attachment: Prefetching.patch

This also includes two additional test cases.  The public exposure to the 
prefetching is controlled solely by IndexReader.open(Directory,Collection).  If 
you try to query a term that wasn't included, 

This also includes some wildcard handling.  I'm not sure it's absolutely 
necessary for WildcardTermEnum or FuzzyTermEnum.  You can probably remove the 
entire if (entry == null) block of PrefetchedTermInfosReader.seekEnum.  But 
this provides more flexibility.

> Optimize Memory Use for Short-Lived Indexes (Do not load TermInfoIndex if you 
> know the queries ahead of time)
> -------------------------------------------------------------------------------------------------------------
>
>          Key: LUCENE-506
>          URL: http://issues.apache.org/jira/browse/LUCENE-506
>      Project: Lucene - Java
>         Type: Improvement
>   Components: Index
>     Versions: 2.0
>  Environment: Patch against Lucene 1.9 trunk as of Mar 1 06
>     Reporter: Steven Tamm
>  Attachments: Prefetching.patch
>
> Summary: Provide a way to avoid loading the TermInfoIndex into memory if you 
> know all the terms you are ever going to query.
> In our search environment, we have a large number of indexes (many 
> thousands), any of which may be queried by any number of hosts.  These 
> indexes may be very large (~1M document), but since we have a low term/doc 
> ratio, we have 7-11M terms.  With an index interval of 128, that means 
> ~70-90K terms.  On loading the index, it instantiates a Term, a TermInfo, a 
> String, and a char[].  When the document is long lived, this makes some sense 
> because you can quickly search the list of terms using binary search.  
> However, since we throw away the Indexes very often, a lot of garbage is 
> created per query
> Here's an example where we load a large index 10 times.  This corresponds to 
> 7MB of garbage per query.
>           percent          live          alloc'ed  stack class
>  rank   self  accum     bytes objs     bytes  objs trace name
>     1  4.48%  4.48%   4678736 128946  23393680 644730 387749 char[]
>     3  3.95% 12.61%   4126272 128946  20631360 644730 387751 
> org.apache.lucene.index.TermInfo
>     6  2.96% 22.71%   3094704 128946  15473520 644730 387748 java.lang.String
>     8  1.98% 26.97%   2063136 128946  10315680 644730 387750 
> org.apache.lucene.index.Term
> This adds up after a while.  Since we know exactly which Terms we're going to 
> search for before even opening the index, there's no need to allocate this 
> much memory.  Upon opening the index, we can go through the TII in sequential 
> order and retrieve the entries into the main term dictionary and reduce the 
> storage requirements dramatically.  This reduces the amount of garbage 
> generated by querying by about 60% if you only make 1 query/index with a 77% 
> increase in throughput.
> This is accomplished by factoring out the "index loading" aspects of 
> TermInfosReader into a new file, SegmentTermInfosReader.  TermInfosReader 
> becomes a base class to allow access to terms.  A new class, 
> PrefetchedTermInfosReader will, upon startup, sort the passed in terms and 
> retrieve the IndexEntries for those terms.  IndexReader and SegmentReader are 
> modified to take new constructor methods that take a Collection of Terms that 
> correspond to the total set of terms that will ever be searched in the life 
> of the index.
> In order to support the "skipping" behavior, some changes need to be made to 
> SegmentTermEnum: specifically, we need to be able to go back an entry in 
> order to retrieve the previous TermInfo and IndexPointer.  This is because, 
> unlike the normal case, with the index  we want to return the value right 
> before the intended field (so that we can be behind the desired termin the 
> main dictionary).   For example, if we're looking for  "apple" in the index,  
> and the two adjacent values are "abba" and "argon", we want to return "abba" 
> instead of "argon".  That way we won't miss any terms in the real index.   
> This code is confusing; it should probably be moved to an subclass of 
> TermBuffer, but that required more code.  Not wanting to modify TermBuffer to 
> keep it small, also lead to the odd NPE catch in SegmentTermEnum.java.  
> Stickler for contracts may want to rename SegmentTermEnum.skipTo() to a 
> different name because it implements a different contract: but it would be 
> useful for anyone trying to skip around in the TII, so I figured it was the 
> right thing to do.

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators:
   http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see:
   http://www.atlassian.com/software/jira


---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

[jira] Updated: (LUCENE-506) Optimize Memory Use for Short-Lived Indexes (Do not load TermInfoIndex if you know the queries ahead of time)

Reply via email to