[jira] Commented: (LUCENE-2205) Rework of the TermInfosReader class to remove the Terms[], TermInfos[], and the index pointer long[] and create a more memory efficient data structure.

Michael McCandless (JIRA) Wed, 13 Jan 2010 11:58:25 -0800

    [ 
https://issues.apache.org/jira/browse/LUCENE-2205?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12799906#action_12799906
 ]


Michael McCandless commented on LUCENE-2205:
--------------------------------------------

Another benefit doing this with flex is you can also change the index file 
format, ie write the vints to disk (so "build" is done at index time, not 
reader startup time), so the init time would be even faster.

Hmm... it's surprising you're seeing faster decode time -- it looks like you 
read a vint per character of each index term compared, during the binary 
search?  Vs String.compareTo done by trunk.  (Though, if those characters are 
simple ascii, then the vint is always a single byte read).

Actually, couldn't you simply compare the utf8 bytes (plus a "fixup", to match 
UTF16 sort order), which would require no per-character vint decode?  (flex 
does this, since it holds the term data as utf8 bytes in memory).

> Rework of the TermInfosReader class to remove the Terms[], TermInfos[], and 
> the index pointer long[] and create a more memory efficient data structure.
> -------------------------------------------------------------------------------------------------------------------------------------------------------
>
>                 Key: LUCENE-2205
>                 URL: https://issues.apache.org/jira/browse/LUCENE-2205
>             Project: Lucene - Java
>          Issue Type: Improvement
>         Environment: Java5
>            Reporter: Aaron McCurry
>         Attachments: patch-final.txt, RandomAccessTest.java, rawoutput.txt
>
>
> Basically packing those three arrays into a byte array with an int array as 
> an index offset.  
> The performance benefits are stagering on my test index (of size 6.2 GB, with 
> ~1,000,000 documents and ~175,000,000 terms), the memory needed to load the 
> terminfos into memory were reduced to 17% of there original size.  From 291.5 
> MB to 49.7 MB.  The random access speed has been made better by 1-2%, load 
> time of the segments are ~40% faster as well, and full GC's on my JVM were 
> made 7 times faster.
> I have already performed the work and am offering this code as a patch.  
> Currently all test in the trunk pass with this new code enabled.  I did write 
> a system property switch to allow for the original implementation to be used 
> as well.
> -Dorg.apache.lucene.index.TermInfosReader=default or small
> I have also written a blog about this patch here is the link.
> http://www.nearinfinity.com/blogs/aaron_mccurry/my_first_lucene_patch.html

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] Commented: (LUCENE-2205) Rework of the TermInfosReader class to remove the Terms[], TermInfos[], and the index pointer long[] and create a more memory efficient data structure.

Reply via email to