[
https://issues.apache.org/jira/browse/LUCENE-2205?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13105624#comment-13105624
]
Michael McCandless commented on LUCENE-2205:
--------------------------------------------
OK, thinking more here... the fact that this won't change the index
format, and only replaces the low-level representation & methods for
how indexed terms are held in RAM and accessed, means that the risk
here is actually quite low.
And the gains are tremendous (much lower RAM usage; must less GC load;
faster IR init time). Users shouldn't have to wait for 4.0 to get
these improvements.
Seek cost will go up, but this likely doesn't often matter (3.x
doesn't have any super-seek-intensive queries). Maybe the primary-key
lookup case is the worst-case, so we should measure that?
I think we can port back some help from trunk to support this, eg
ByteArrayDataInput (to get readVInt/readVLong/etc. on a byte[]).
I don't think we need to make this switchable with a system prop;
let's just do a hard cutover to the new impl?
Instead of writing the term data as vLong, can we just write the UTF8
bytes? On seek we can convert incoming term's text to UTF8, and then
use trunk's UTF8SortedAsUTF16Comparator to do the compares in the
binary search (so we keep 3.x's UTF16 term sort order).
We should also remember on commit merge this to 4.0's preflex codec...
Aaron, on future iterations, could you use "svn diff" to produce a
single patch file (instead of separate files as attachments)? This
way I (and others) can easily apply it to a local checkout for testing...
> Rework of the TermInfosReader class to remove the Terms[], TermInfos[], and
> the index pointer long[] and create a more memory efficient data structure.
> -------------------------------------------------------------------------------------------------------------------------------------------------------
>
> Key: LUCENE-2205
> URL: https://issues.apache.org/jira/browse/LUCENE-2205
> Project: Lucene - Java
> Issue Type: Improvement
> Components: core/index
> Environment: Java5
> Reporter: Aaron McCurry
> Fix For: 3.5
>
> Attachments: RandomAccessTest.java, TermInfosReader.java,
> TermInfosReaderIndex.java, TermInfosReaderIndexDefault.java,
> TermInfosReaderIndexSmall.java, patch-final.txt, rawoutput.txt
>
>
> Basically packing those three arrays into a byte array with an int array as
> an index offset.
> The performance benefits are stagering on my test index (of size 6.2 GB, with
> ~1,000,000 documents and ~175,000,000 terms), the memory needed to load the
> terminfos into memory were reduced to 17% of there original size. From 291.5
> MB to 49.7 MB. The random access speed has been made better by 1-2%, load
> time of the segments are ~40% faster as well, and full GC's on my JVM were
> made 7 times faster.
> I have already performed the work and am offering this code as a patch.
> Currently all test in the trunk pass with this new code enabled. I did write
> a system property switch to allow for the original implementation to be used
> as well.
> -Dorg.apache.lucene.index.TermInfosReader=default or small
> I have also written a blog about this patch here is the link.
> http://www.nearinfinity.com/blogs/aaron_mccurry/my_first_lucene_patch.html
--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]