[ https://issues.apache.org/jira/browse/LUCENE-2329?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12848626#action_12848626 ]
Michael McCandless commented on LUCENE-2329: -------------------------------------------- Sweet, this looks great Michael! Less RAM used and faster indexing (much less GC load) -- win/win. It's a little surprising that the segment count dropped from 41 -> 32? Ie, how much less RAM do the parallel arrays take? They save the object header per-unique-term, and 4 bytes on 64bit JREs since the "pointer" is now an int and not a real pointer? But, other things use RAM (the docIDs in the postings themselves, norms, etc.) so it's surprising the savings was so much that you get 22% fewer segments... are you sure there isn't a bug in the RAM usage accounting? > Use parallel arrays instead of PostingList objects > -------------------------------------------------- > > Key: LUCENE-2329 > URL: https://issues.apache.org/jira/browse/LUCENE-2329 > Project: Lucene - Java > Issue Type: Improvement > Components: Index > Reporter: Michael Busch > Assignee: Michael Busch > Priority: Minor > Fix For: 3.1 > > Attachments: lucene-2329.patch, lucene-2329.patch, lucene-2329.patch > > > This is Mike's idea that was discussed in LUCENE-2293 and LUCENE-2324. > In order to avoid having very many long-living PostingList objects in > TermsHashPerField we want to switch to parallel arrays. The termsHash will > simply be a int[] which maps each term to dense termIDs. > All data that the PostingList classes currently hold will then we placed in > parallel arrays, where the termID is the index into the arrays. This will > avoid the need for object pooling, will remove the overhead of object > initialization and garbage collection. Especially garbage collection should > benefit significantly when the JVM runs out of memory, because in such a > situation the gc mark times can get very long if there is a big number of > long-living objects in memory. > Another benefit could be to build more efficient TermVectors. We could avoid > the need of having to store the term string per document in the TermVector. > Instead we could just store the segment-wide termIDs. This would reduce the > size and also make it easier to implement efficient algorithms that use > TermVectors, because no term mapping across documents in a segment would be > necessary. Though this improvement we can make with a separate jira issue. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. --------------------------------------------------------------------- To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org