[jira] Updated: (LUCENE-2329) Use parallel arrays instead of PostingList objects

Michael Busch (JIRA) Wed, 31 Mar 2010 17:34:53 -0700

     [ 
https://issues.apache.org/jira/browse/LUCENE-2329?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


Michael Busch updated LUCENE-2329:
----------------------------------

    Attachment: lucene-2329-2.patch

This patch:
 * Changes DocumentsWriter to trigger the flush using bytesAllocated instead of 
bytesUsed to improve the "running hot" issue Mike's seeing
 * Improves the ParallelPostingsArray to grow using ArrayUtil.oversize()

In IRC we discussed changing TermsHashPerField to shrink the parallel arrays in 
freeRAM(), but that involves tricky thread-safety changes, because one thread 
could call DocumentsWriter.balanceRAM(), which triggers freeRAM() across *all* 
thread states, while other threads keep indexing.

We decided to leave it the way it currently works: we discard the whole 
parallel array during flush and don't reuse it.  This is not as optimal as it 
could be, but once LUCENE-2324 is done this won't be an issue anymore anyway.

Note that this new patch is against the flex branch: I thought we'd switch it 
over soon anyway?  I can also create a patch for trunk if that's preferred.

> Use parallel arrays instead of PostingList objects
> --------------------------------------------------
>
>                 Key: LUCENE-2329
>                 URL: https://issues.apache.org/jira/browse/LUCENE-2329
>             Project: Lucene - Java
>          Issue Type: Improvement
>          Components: Index
>            Reporter: Michael Busch
>            Assignee: Michael Busch
>            Priority: Minor
>             Fix For: 3.1
>
>         Attachments: lucene-2329-2.patch, lucene-2329.patch, 
> lucene-2329.patch, lucene-2329.patch
>
>
> This is Mike's idea that was discussed in LUCENE-2293 and LUCENE-2324.
> In order to avoid having very many long-living PostingList objects in 
> TermsHashPerField we want to switch to parallel arrays.  The termsHash will 
> simply be a int[] which maps each term to dense termIDs.
> All data that the PostingList classes currently hold will then we placed in 
> parallel arrays, where the termID is the index into the arrays.  This will 
> avoid the need for object pooling, will remove the overhead of object 
> initialization and garbage collection.  Especially garbage collection should 
> benefit significantly when the JVM runs out of memory, because in such a 
> situation the gc mark times can get very long if there is a big number of 
> long-living objects in memory.
> Another benefit could be to build more efficient TermVectors.  We could avoid 
> the need of having to store the term string per document in the TermVector.  
> Instead we could just store the segment-wide termIDs.  This would reduce the 
> size and also make it easier to implement efficient algorithms that use 
> TermVectors, because no term mapping across documents in a segment would be 
> necessary.  Though this improvement we can make with a separate jira issue.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] Updated: (LUCENE-2329) Use parallel arrays instead of PostingList objects

Reply via email to