[
https://issues.apache.org/jira/browse/LUCENE-2205?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13109787#comment-13109787
]
Michael McCandless commented on LUCENE-2205:
--------------------------------------------
Patch looks great Aaron! Very much simplified... some comments:
* Instead of separate build method, could we have
TermInfosReaderIndex's ctor take all the args? Then we can make
its private fields final?
* I think the index and indexLength can be final, in
TermInfosReader?
* Can you put the GrowableByteArrayDataOutput as a separate source
file in oal.store? Seems useful!
* Hmm should indexToTermsArray be a long[]...? I wonder how large
your index would have to be to overflow 2.1GB of the byte[]
format...
* We could further reduce the RAM usage by using packed ints
(oal.util.packed) for the indexToTerms array; this way each
indexed term would only use as many bits are actually required to
address the byte[] (and, this would solve the int[]/long[] problem
since packed ints are logically a long[]).
* I think we should just always trim? (Ie we don't need the
{{private boolean trim}})
* Could you add comment "Just for testing" to
TermInfosReaderIndex.getTerm?
* For the compareTo methods, can you add to the jdocs that this
"compares term to index term", ie it returns negative N when term
is less than index term?
* Hmm... I wonder if memory fragmentation will cause problems for
the allocating/growing the single byte[]. Also, a single byte[]
can "only" address 2.1B bytes (the same overflow problem as
above). Maybe we should port back PagedBytes (from trunk
oal.util) and use that instead? If we did that, then we could
create a simple DataInput impl that reads from that.
* Could you please remove the @author tags? Thanks. It's Apache's
policy (or at least discouraged) to not commit author tags...
> Rework of the TermInfosReader class to remove the Terms[], TermInfos[], and
> the index pointer long[] and create a more memory efficient data structure.
> -------------------------------------------------------------------------------------------------------------------------------------------------------
>
> Key: LUCENE-2205
> URL: https://issues.apache.org/jira/browse/LUCENE-2205
> Project: Lucene - Java
> Issue Type: Improvement
> Components: core/index
> Environment: Java5
> Reporter: Aaron McCurry
> Assignee: Michael McCandless
> Fix For: 3.5
>
> Attachments: RandomAccessTest.java, TermInfosReader.java,
> TermInfosReaderIndex.java, TermInfosReaderIndexDefault.java,
> TermInfosReaderIndexSmall.java, lowmemory_w_utf8_encoding.patch,
> patch-final.txt, rawoutput.txt
>
>
> Basically packing those three arrays into a byte array with an int array as
> an index offset.
> The performance benefits are stagering on my test index (of size 6.2 GB, with
> ~1,000,000 documents and ~175,000,000 terms), the memory needed to load the
> terminfos into memory were reduced to 17% of there original size. From 291.5
> MB to 49.7 MB. The random access speed has been made better by 1-2%, load
> time of the segments are ~40% faster as well, and full GC's on my JVM were
> made 7 times faster.
> I have already performed the work and am offering this code as a patch.
> Currently all test in the trunk pass with this new code enabled. I did write
> a system property switch to allow for the original implementation to be used
> as well.
> -Dorg.apache.lucene.index.TermInfosReader=default or small
> I have also written a blog about this patch here is the link.
> http://www.nearinfinity.com/blogs/aaron_mccurry/my_first_lucene_patch.html
--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]