[jira] Updated: (LUCENE-2843) Add variable-gap terms index impl.

Michael McCandless (JIRA) Tue, 04 Jan 2011 03:36:12 -0800

     [ 
https://issues.apache.org/jira/browse/LUCENE-2843?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


Michael McCandless updated LUCENE-2843:
---------------------------------------

    Attachment: LUCENE-2843.patch

New patch, resolving all nocommits.  I think it's ready to commit.

I cutover StandardCodec to the VariableGapTermsIndex, with 'every 32' as the 
index term selection policy.  We could lower 32 to eg 20, since FST uses so 
much less RAM than packed ints/bytes, but for now I'm just leaving it at 32 to 
be safe.

The "let FST builder pick the indexed terms" turns out not to be very easy to 
do w/ this API.  I put a big comment explaining this in the var gap writer.  
It's also not clear we'd want to use this approach, since the resulting term 
index density can then vary substantially.

> Add variable-gap terms index impl.
> ----------------------------------
>
>                 Key: LUCENE-2843
>                 URL: https://issues.apache.org/jira/browse/LUCENE-2843
>             Project: Lucene - Java
>          Issue Type: Improvement
>          Components: Index
>            Reporter: Michael McCandless
>            Assignee: Michael McCandless
>             Fix For: 4.0
>
>         Attachments: LUCENE-2843.patch, LUCENE-2843.patch
>
>
> PrefixCodedTermsReader/Writer (used by all "real" core codecs) already
> supports pluggable terms index impls.
> The only impl we have now is FixedGapTermsIndexReader/Writer, which
> picks every Nth (default 32) term and holds it in efficient packed
> int/byte arrays in RAM.  This is already an enormous improvement (RAM
> reduction, init time) over 3.x.
> This patch adds another impl, VariableGapTermsIndexReader/Writer,
> which lets you specify an arbitrary IndexTermSelector to pick which
> terms are indexed, and then uses an FST to hold the indexed terms.
> This is typically even more memory efficient than packed int/byte
> arrays, though, it does not support ord() so it's not quite a fair
> comparison.
> I had to relax the terms index plugin api for
> PrefixCodedTermsReader/Writer to not assume that the terms index impl
> supports ord.
> I also did some cleanup of the FST/FSTEnum APIs and impls, and broke
> out separate seekCeil and seekFloor in FSTEnum.  Eg we need seekFloor
> when the FST is used as a terms index but seekCeil when it's holding
> all terms in the index (ie which SimpleText uses FSTs for).

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] Updated: (LUCENE-2843) Add variable-gap terms index impl.

Reply via email to