[jira] Commented: (LUCENE-2329) Use parallel arrays instead of PostingList objects

Michael McCandless (JIRA) Tue, 23 Mar 2010 08:55:49 -0700

    [ 
https://issues.apache.org/jira/browse/LUCENE-2329?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12848767#action_12848767
 ]


Michael McCandless commented on LUCENE-2329:
--------------------------------------------

bq. But, keep in mind that TermVectors were enabled too.

OK, but, RAM used by TermVectors* shouldn't participate in the accounting... ie 
it only holds RAM for the one doc, at a time.

bq. And the number of "unique terms" in the 2nd TermsHash is higher, i.e. if 
you summed up numPostings from the 2nd TermsHash in each round that sum should 
be higher than numPostings from the first TermsHash.

1st TermsHash = current trunk and 2nd TermsHash = this patch?  Ie, it has more 
unique terms at flush time (because it's more RAM efficient)?  If so, then yes, 
I agree :)  But 22% fewer still seems too good to be true...

> Use parallel arrays instead of PostingList objects
> --------------------------------------------------
>
>                 Key: LUCENE-2329
>                 URL: https://issues.apache.org/jira/browse/LUCENE-2329
>             Project: Lucene - Java
>          Issue Type: Improvement
>          Components: Index
>            Reporter: Michael Busch
>            Assignee: Michael Busch
>            Priority: Minor
>             Fix For: 3.1
>
>         Attachments: lucene-2329.patch, lucene-2329.patch, lucene-2329.patch
>
>
> This is Mike's idea that was discussed in LUCENE-2293 and LUCENE-2324.
> In order to avoid having very many long-living PostingList objects in 
> TermsHashPerField we want to switch to parallel arrays.  The termsHash will 
> simply be a int[] which maps each term to dense termIDs.
> All data that the PostingList classes currently hold will then we placed in 
> parallel arrays, where the termID is the index into the arrays.  This will 
> avoid the need for object pooling, will remove the overhead of object 
> initialization and garbage collection.  Especially garbage collection should 
> benefit significantly when the JVM runs out of memory, because in such a 
> situation the gc mark times can get very long if there is a big number of 
> long-living objects in memory.
> Another benefit could be to build more efficient TermVectors.  We could avoid 
> the need of having to store the term string per document in the TermVector.  
> Instead we could just store the segment-wide termIDs.  This would reduce the 
> size and also make it easier to implement efficient algorithms that use 
> TermVectors, because no term mapping across documents in a segment would be 
> necessary.  Though this improvement we can make with a separate jira issue.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] Commented: (LUCENE-2329) Use parallel arrays instead of PostingList objects

Reply via email to