[jira] [Commented] (LUCENE-2205) Rework of the TermInfosReader class to remove the Terms[], TermInfos[], and the index pointer long[] and create a more memory efficient data structure.

Michael McCandless (JIRA) Thu, 22 Sep 2011 02:57:52 -0700

    [ 
https://issues.apache.org/jira/browse/LUCENE-2205?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13112448#comment-13112448
 ]


Michael McCandless commented on LUCENE-2205:
--------------------------------------------


bq. However due to the growing array the likely hood of it landing on 2.1 
exactly is probably not likely. So it would probably error out sometime before 
that.

Actually ArrayUtil.grow is careful about this limit: on that final
grow() it'll go right up to Java's max allowed array size.

bq. I'm also building up a 2B terms index (using Test2BTerms), and then I'll 
compare patch/3.x on that index.

OK this finished -- the test passed with the patch (good news!), and
3.x (phew!).

With 3.x, IR.open takes 43.69 seconds and uses 2955 MB of heap.

With the patch, IR.open takes 9.94 seconds (4.4X faster) and uses 505
MB of heap (5.9X less): AWESOME!

The test then does a lookup of a random set of terms.  3.x does this
in 51.2 sec; patch does it in 48.5 sec, good!  (Same set of terms).

bq. I can back port PagedBytes instead if you think it's really needed.

I think we should cutover to PagedBytes.  Today the number of terms we
can support is 2.1B times index interval (default 128), so ~274.9 B
terms. 

But with the current patch, we can roughly estimate bytes per indexed
term:

  * 1 byte for fieldCounter

  * 15 bytes for term UTF8 bytes (non-English content)

  * 1 byte for docFreq (vast majority of terms are < 128 df)

  * 1 byte for skipOffset (vast majority of terms have no skip).

  * 5 bytes for freqOffset

  * 5 bytes for proxOffset

  * 5 bytes for indexOffset

  * 4 bytes for indexToTerms entry

So total ~37 bytes per indexed term, which means ~58.0 M indexed terms
can fit in the 2.1B byte[] limit, or 7.4 B total terms at the default
128 index interval.  This makes me a little nervous... we've already
seen have apps that are well over 2.1 B terms.

Even before the 2.1B limit, it makes me nervous relying on the JRE to
allocate such a large contiguous chunk of RAM.

A couple other random things I noticed:

  * When we estimate the initial size of the byte[] (based on .tii
    file size), I think we should divide by indexDivisor?

  * We should conditionally write the skipOffset, only when docFreq is
    >= skipInterval.  Since most terms won't have skip data we can
    save 1 byte for them...


> Rework of the TermInfosReader class to remove the Terms[], TermInfos[], and 
> the index pointer long[] and create a more memory efficient data structure.
> -------------------------------------------------------------------------------------------------------------------------------------------------------
>
>                 Key: LUCENE-2205
>                 URL: https://issues.apache.org/jira/browse/LUCENE-2205
>             Project: Lucene - Java
>          Issue Type: Improvement
>          Components: core/index
>         Environment: Java5
>            Reporter: Aaron McCurry
>            Assignee: Michael McCandless
>             Fix For: 3.5
>
>         Attachments: RandomAccessTest.java, TermInfosReader.java, 
> TermInfosReaderIndex.java, TermInfosReaderIndexDefault.java, 
> TermInfosReaderIndexSmall.java, lowmemory_w_utf8_encoding.patch, 
> patch-final.txt, rawoutput.txt
>
>
> Basically packing those three arrays into a byte array with an int array as 
> an index offset.  
> The performance benefits are stagering on my test index (of size 6.2 GB, with 
> ~1,000,000 documents and ~175,000,000 terms), the memory needed to load the 
> terminfos into memory were reduced to 17% of there original size.  From 291.5 
> MB to 49.7 MB.  The random access speed has been made better by 1-2%, load 
> time of the segments are ~40% faster as well, and full GC's on my JVM were 
> made 7 times faster.
> I have already performed the work and am offering this code as a patch.  
> Currently all test in the trunk pass with this new code enabled.  I did write 
> a system property switch to allow for the original implementation to be used 
> as well.
> -Dorg.apache.lucene.index.TermInfosReader=default or small
> I have also written a blog about this patch here is the link.
> http://www.nearinfinity.com/blogs/aaron_mccurry/my_first_lucene_patch.html

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] [Commented] (LUCENE-2205) Rework of the TermInfosReader class to remove the Terms[], TermInfos[], and the index pointer long[] and create a more memory efficient data structure.

Reply via email to