[jira] Commented: (LUCENE-2843) Add variable-gap terms index impl.

Toke Eskildsen (JIRA) Wed, 02 Feb 2011 07:49:57 -0800

    [ 
https://issues.apache.org/jira/browse/LUCENE-2843?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12989679#comment-12989679
 ]


Toke Eskildsen commented on LUCENE-2843:
----------------------------------------

Thank you. I will use the FixedGap-version myself, but that only works when I'm 
the one controlling the index build, right?

As for the faceting system then the principle really simple: Instead of holding 
terms (BytesRefs) in memory, I just hold their ordinals. As the terms 
themselves only need to be resolved when the final faceting result is to be 
returned, seeking for a few hundred or thousand terms by their ordinal has 
worked very well so far (no guarantees for old hardware such as spinning disks 
though).

The memory savings over holding BytesRefs in memory of course varies with term 
lengths. There are some numbers at 
https://sbdevel.wordpress.com/2010/10/11/hierarchical-faceting/ if someone 
finds it interesting and LUCENE-2369 has some measurements of the same 
principle applied to sorting.

> Add variable-gap terms index impl.
> ----------------------------------
>
>                 Key: LUCENE-2843
>                 URL: https://issues.apache.org/jira/browse/LUCENE-2843
>             Project: Lucene - Java
>          Issue Type: Improvement
>          Components: Index
>            Reporter: Michael McCandless
>            Assignee: Michael McCandless
>             Fix For: 4.0
>
>         Attachments: LUCENE-2843.patch, LUCENE-2843.patch
>
>
> PrefixCodedTermsReader/Writer (used by all "real" core codecs) already
> supports pluggable terms index impls.
> The only impl we have now is FixedGapTermsIndexReader/Writer, which
> picks every Nth (default 32) term and holds it in efficient packed
> int/byte arrays in RAM.  This is already an enormous improvement (RAM
> reduction, init time) over 3.x.
> This patch adds another impl, VariableGapTermsIndexReader/Writer,
> which lets you specify an arbitrary IndexTermSelector to pick which
> terms are indexed, and then uses an FST to hold the indexed terms.
> This is typically even more memory efficient than packed int/byte
> arrays, though, it does not support ord() so it's not quite a fair
> comparison.
> I had to relax the terms index plugin api for
> PrefixCodedTermsReader/Writer to not assume that the terms index impl
> supports ord.
> I also did some cleanup of the FST/FSTEnum APIs and impls, and broke
> out separate seekCeil and seekFloor in FSTEnum.  Eg we need seekFloor
> when the FST is used as a terms index but seekCeil when it's holding
> all terms in the index (ie which SimpleText uses FSTs for).

-- 
This message is automatically generated by JIRA.
-
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] Commented: (LUCENE-2843) Add variable-gap terms index impl.

Reply via email to