[ 
https://issues.apache.org/jira/browse/LUCENE-2205?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13109413#comment-13109413
 ] 

Aaron McCurry commented on LUCENE-2205:
---------------------------------------

I have reimplemented the patch using the UTF8SortedAsUTF16Comparator as well as 
ByteArrayDataInput.  The patch also contains a unit test and I have run all the 
current tests of the core plus the contribs and everything passes.  As a plus 
the code has gotten much simpler.

During my functional testing I created a test index with small but very diverse 
terms.  Roughly 50 terms per document with 50 million documents.  So there are 
approximately 2.5 billion terms in this index.

The current 3x branch produces:
50000000 documents at a heap size of 598902872.

The patched version produces:
50000000 documents at a heap size of 282526224.

The random access performance of this index goes to the patch.  Running 200 
passes of a collection of randomly sampled queries (queries changes each time) 
produces the following:

The current 3x branch produces:
4186.0225 avg response time in ms

The patched version produces:
2930.1371 avg response time in ms

NOTE: The hard drive I was using is a very slow drive.  While using smaller 
indexes the patch and the current branch are very close to the same 
performance.  Depending on the pass the either one was faster.


> Rework of the TermInfosReader class to remove the Terms[], TermInfos[], and 
> the index pointer long[] and create a more memory efficient data structure.
> -------------------------------------------------------------------------------------------------------------------------------------------------------
>
>                 Key: LUCENE-2205
>                 URL: https://issues.apache.org/jira/browse/LUCENE-2205
>             Project: Lucene - Java
>          Issue Type: Improvement
>          Components: core/index
>         Environment: Java5
>            Reporter: Aaron McCurry
>            Assignee: Michael McCandless
>             Fix For: 3.5
>
>         Attachments: RandomAccessTest.java, TermInfosReader.java, 
> TermInfosReaderIndex.java, TermInfosReaderIndexDefault.java, 
> TermInfosReaderIndexSmall.java, patch-final.txt, rawoutput.txt
>
>
> Basically packing those three arrays into a byte array with an int array as 
> an index offset.  
> The performance benefits are stagering on my test index (of size 6.2 GB, with 
> ~1,000,000 documents and ~175,000,000 terms), the memory needed to load the 
> terminfos into memory were reduced to 17% of there original size.  From 291.5 
> MB to 49.7 MB.  The random access speed has been made better by 1-2%, load 
> time of the segments are ~40% faster as well, and full GC's on my JVM were 
> made 7 times faster.
> I have already performed the work and am offering this code as a patch.  
> Currently all test in the trunk pass with this new code enabled.  I did write 
> a system property switch to allow for the original implementation to be used 
> as well.
> -Dorg.apache.lucene.index.TermInfosReader=default or small
> I have also written a blog about this patch here is the link.
> http://www.nearinfinity.com/blogs/aaron_mccurry/my_first_lucene_patch.html

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to