[
https://issues.apache.org/jira/browse/LUCENE-2205?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13109780#comment-13109780
]
Aaron McCurry edited comment on LUCENE-2205 at 9/21/11 7:09 PM:
----------------------------------------------------------------
I found a major bug in my test. I was using keyword analyzer instead of
whitespace or standard, thus it was turning everyone of my sentences that
contained 50 randomly generated words into 1 huge token. This helps to explain
why the heap space results are not that stellar, because the fewer terms there
are (as well as the larger they are), the less the patch helps reduce space.
I'm retesting now.
was (Author: amccurry):
I found a major bug in my test. I was using keyword analyzer instead of
whitespace or standard, thus it was turning everyone of my sentences that
contained 100 randomly generated words into 1 huge token. This helps to
explain why the heap space results are not that stellar, because the fewer
terms there are (as well as the larger they are), the less the patch helps
reduce space. I'm retesting now.
> Rework of the TermInfosReader class to remove the Terms[], TermInfos[], and
> the index pointer long[] and create a more memory efficient data structure.
> -------------------------------------------------------------------------------------------------------------------------------------------------------
>
> Key: LUCENE-2205
> URL: https://issues.apache.org/jira/browse/LUCENE-2205
> Project: Lucene - Java
> Issue Type: Improvement
> Components: core/index
> Environment: Java5
> Reporter: Aaron McCurry
> Assignee: Michael McCandless
> Fix For: 3.5
>
> Attachments: RandomAccessTest.java, TermInfosReader.java,
> TermInfosReaderIndex.java, TermInfosReaderIndexDefault.java,
> TermInfosReaderIndexSmall.java, lowmemory_w_utf8_encoding.patch,
> patch-final.txt, rawoutput.txt
>
>
> Basically packing those three arrays into a byte array with an int array as
> an index offset.
> The performance benefits are stagering on my test index (of size 6.2 GB, with
> ~1,000,000 documents and ~175,000,000 terms), the memory needed to load the
> terminfos into memory were reduced to 17% of there original size. From 291.5
> MB to 49.7 MB. The random access speed has been made better by 1-2%, load
> time of the segments are ~40% faster as well, and full GC's on my JVM were
> made 7 times faster.
> I have already performed the work and am offering this code as a patch.
> Currently all test in the trunk pass with this new code enabled. I did write
> a system property switch to allow for the original implementation to be used
> as well.
> -Dorg.apache.lucene.index.TermInfosReader=default or small
> I have also written a blog about this patch here is the link.
> http://www.nearinfinity.com/blogs/aaron_mccurry/my_first_lucene_patch.html
--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]