[ https://issues.apache.org/jira/browse/LUCENE-2329?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Michael Busch updated LUCENE-2329: ---------------------------------- Attachment: lucene-2329-2.patch This patch: * Changes DocumentsWriter to trigger the flush using bytesAllocated instead of bytesUsed to improve the "running hot" issue Mike's seeing * Improves the ParallelPostingsArray to grow using ArrayUtil.oversize() In IRC we discussed changing TermsHashPerField to shrink the parallel arrays in freeRAM(), but that involves tricky thread-safety changes, because one thread could call DocumentsWriter.balanceRAM(), which triggers freeRAM() across *all* thread states, while other threads keep indexing. We decided to leave it the way it currently works: we discard the whole parallel array during flush and don't reuse it. This is not as optimal as it could be, but once LUCENE-2324 is done this won't be an issue anymore anyway. Note that this new patch is against the flex branch: I thought we'd switch it over soon anyway? I can also create a patch for trunk if that's preferred. > Use parallel arrays instead of PostingList objects > -------------------------------------------------- > > Key: LUCENE-2329 > URL: https://issues.apache.org/jira/browse/LUCENE-2329 > Project: Lucene - Java > Issue Type: Improvement > Components: Index > Reporter: Michael Busch > Assignee: Michael Busch > Priority: Minor > Fix For: 3.1 > > Attachments: lucene-2329-2.patch, lucene-2329.patch, > lucene-2329.patch, lucene-2329.patch > > > This is Mike's idea that was discussed in LUCENE-2293 and LUCENE-2324. > In order to avoid having very many long-living PostingList objects in > TermsHashPerField we want to switch to parallel arrays. The termsHash will > simply be a int[] which maps each term to dense termIDs. > All data that the PostingList classes currently hold will then we placed in > parallel arrays, where the termID is the index into the arrays. This will > avoid the need for object pooling, will remove the overhead of object > initialization and garbage collection. Especially garbage collection should > benefit significantly when the JVM runs out of memory, because in such a > situation the gc mark times can get very long if there is a big number of > long-living objects in memory. > Another benefit could be to build more efficient TermVectors. We could avoid > the need of having to store the term string per document in the TermVector. > Instead we could just store the segment-wide termIDs. This would reduce the > size and also make it easier to implement efficient algorithms that use > TermVectors, because no term mapping across documents in a segment would be > necessary. Though this improvement we can make with a separate jira issue. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. --------------------------------------------------------------------- To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org