[ https://issues.apache.org/jira/browse/LUCENE-2205?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13120504#comment-13120504 ]
Michael McCandless commented on LUCENE-2205: -------------------------------------------- Looking great Aaron! It's spooky that PagedBytesDataInput calls fillSlice for every .readByte -- can't we have it hold the current block and then only switch to a new block in .readByte() if it's at the end of current block? Same for PagedBytesDataOutput? We can have these DataInput/Output impls be private to PagedBytes (so they can access the pages directly)? You should be able to use PackedInts.GrowableWriter, to append the ints directly, instead of first writing to the indexToTermsArray and then separately to the packed ints? Saves the added transient RAM usage and 2nd pass. I don't think you need to write the indexToTerms packed ints into a PagedBytesDataOutput (if you use GrowableWriter it just uses a byte[] under the hood, and resizes as needed)? This array will be small enough, since it's the packed int byte address of every 128th term, I think (but dataOutput does need to be paged bytes). > Rework of the TermInfosReader class to remove the Terms[], TermInfos[], and > the index pointer long[] and create a more memory efficient data structure. > ------------------------------------------------------------------------------------------------------------------------------------------------------- > > Key: LUCENE-2205 > URL: https://issues.apache.org/jira/browse/LUCENE-2205 > Project: Lucene - Java > Issue Type: Improvement > Components: core/index > Environment: Java5 > Reporter: Aaron McCurry > Assignee: Michael McCandless > Fix For: 3.5 > > Attachments: RandomAccessTest.java, TermInfosReader.java, > TermInfosReaderIndex.java, TermInfosReaderIndexDefault.java, > TermInfosReaderIndexSmall.java, lowmemory_w_utf8_encoding.patch, > lowmemory_w_utf8_encoding.v4.patch, patch-final.txt, rawoutput.txt > > > Basically packing those three arrays into a byte array with an int array as > an index offset. > The performance benefits are stagering on my test index (of size 6.2 GB, with > ~1,000,000 documents and ~175,000,000 terms), the memory needed to load the > terminfos into memory were reduced to 17% of there original size. From 291.5 > MB to 49.7 MB. The random access speed has been made better by 1-2%, load > time of the segments are ~40% faster as well, and full GC's on my JVM were > made 7 times faster. > I have already performed the work and am offering this code as a patch. > Currently all test in the trunk pass with this new code enabled. I did write > a system property switch to allow for the original implementation to be used > as well. > -Dorg.apache.lucene.index.TermInfosReader=default or small > I have also written a blog about this patch here is the link. > http://www.nearinfinity.com/blogs/aaron_mccurry/my_first_lucene_patch.html -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira --------------------------------------------------------------------- To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org