[jira] [Commented] (LUCENE-2205) Rework of the TermInfosReader class to remove the Terms[], TermInfos[], and the index pointer long[] and create a more memory efficient data structure.

Michael McCandless (JIRA) Wed, 21 Sep 2011 12:09:35 -0700

    [ 
https://issues.apache.org/jira/browse/LUCENE-2205?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13109787#comment-13109787
 ]


Michael McCandless commented on LUCENE-2205:
--------------------------------------------


Patch looks great Aaron!  Very much simplified... some comments:

  * Instead of separate build method, could we have
    TermInfosReaderIndex's ctor take all the args?  Then we can make
    its private fields final?

  * I think the index and indexLength can be final, in
    TermInfosReader?

  * Can you put the GrowableByteArrayDataOutput as a separate source
    file in oal.store?  Seems useful!

  * Hmm should indexToTermsArray be a long[]...?  I wonder how large
    your index would have to be to overflow 2.1GB of the byte[]
    format...

  * We could further reduce the RAM usage by using packed ints
    (oal.util.packed) for the indexToTerms array; this way each
    indexed term would only use as many bits are actually required to
    address the byte[] (and, this would solve the int[]/long[] problem
    since packed ints are logically a long[]).

  * I think we should just always trim?  (Ie we don't need the
    {{private boolean trim}})

  * Could you add comment "Just for testing" to
    TermInfosReaderIndex.getTerm?

  * For the compareTo methods, can you add to the jdocs that this
    "compares term to index term", ie it returns negative N when term
    is less than index term?

  * Hmm... I wonder if memory fragmentation will cause problems for
    the allocating/growing the single byte[].  Also, a single byte[]
    can "only" address 2.1B bytes (the same overflow problem as
    above).  Maybe we should port back PagedBytes (from trunk
    oal.util) and use that instead?  If we did that, then we could
    create a simple DataInput impl that reads from that.

  * Could you please remove the @author tags?  Thanks. It's Apache's
    policy (or at least discouraged) to not commit author tags...


> Rework of the TermInfosReader class to remove the Terms[], TermInfos[], and 
> the index pointer long[] and create a more memory efficient data structure.
> -------------------------------------------------------------------------------------------------------------------------------------------------------
>
>                 Key: LUCENE-2205
>                 URL: https://issues.apache.org/jira/browse/LUCENE-2205
>             Project: Lucene - Java
>          Issue Type: Improvement
>          Components: core/index
>         Environment: Java5
>            Reporter: Aaron McCurry
>            Assignee: Michael McCandless
>             Fix For: 3.5
>
>         Attachments: RandomAccessTest.java, TermInfosReader.java, 
> TermInfosReaderIndex.java, TermInfosReaderIndexDefault.java, 
> TermInfosReaderIndexSmall.java, lowmemory_w_utf8_encoding.patch, 
> patch-final.txt, rawoutput.txt
>
>
> Basically packing those three arrays into a byte array with an int array as 
> an index offset.  
> The performance benefits are stagering on my test index (of size 6.2 GB, with 
> ~1,000,000 documents and ~175,000,000 terms), the memory needed to load the 
> terminfos into memory were reduced to 17% of there original size.  From 291.5 
> MB to 49.7 MB.  The random access speed has been made better by 1-2%, load 
> time of the segments are ~40% faster as well, and full GC's on my JVM were 
> made 7 times faster.
> I have already performed the work and am offering this code as a patch.  
> Currently all test in the trunk pass with this new code enabled.  I did write 
> a system property switch to allow for the original implementation to be used 
> as well.
> -Dorg.apache.lucene.index.TermInfosReader=default or small
> I have also written a blog about this patch here is the link.
> http://www.nearinfinity.com/blogs/aaron_mccurry/my_first_lucene_patch.html

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] [Commented] (LUCENE-2205) Rework of the TermInfosReader class to remove the Terms[], TermInfos[], and the index pointer long[] and create a more memory efficient data structure.

Reply via email to