[jira] Commented: (LUCENE-2843) Add variable-gap terms index impl.

Robert Muir (JIRA) Wed, 02 Feb 2011 08:19:57 -0800

    [ 
https://issues.apache.org/jira/browse/LUCENE-2843?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12989695#comment-12989695
 ]


Robert Muir commented on LUCENE-2843:
-------------------------------------

bq. Robert, there is already OrdTermState to hold the ord, but the ordinal 
itself is only interesting if the corresponding term can be seeked from it. 

You can seek to any arbitrary TermState (even if its not holding ord), but it 
might hold other things you don't care about.

bq. As for the FSTEnum-idea then I don't understand how it can work with 
faceting where the terms to return are defined by the documents from a search? 
...But maybe we should discuss that elsewhere.

In the general case, if you are using something like a priority queue to get 
the top-N terms (even if you are filtering by the documents from a search), 
this number would mean that once your priority queue is full, you can tell that 
an entire block of low freq terms is not-competitive to enter the PQ, without 
going to disk?


> Add variable-gap terms index impl.
> ----------------------------------
>
>                 Key: LUCENE-2843
>                 URL: https://issues.apache.org/jira/browse/LUCENE-2843
>             Project: Lucene - Java
>          Issue Type: Improvement
>          Components: Index
>            Reporter: Michael McCandless
>            Assignee: Michael McCandless
>             Fix For: 4.0
>
>         Attachments: LUCENE-2843.patch, LUCENE-2843.patch
>
>
> PrefixCodedTermsReader/Writer (used by all "real" core codecs) already
> supports pluggable terms index impls.
> The only impl we have now is FixedGapTermsIndexReader/Writer, which
> picks every Nth (default 32) term and holds it in efficient packed
> int/byte arrays in RAM.  This is already an enormous improvement (RAM
> reduction, init time) over 3.x.
> This patch adds another impl, VariableGapTermsIndexReader/Writer,
> which lets you specify an arbitrary IndexTermSelector to pick which
> terms are indexed, and then uses an FST to hold the indexed terms.
> This is typically even more memory efficient than packed int/byte
> arrays, though, it does not support ord() so it's not quite a fair
> comparison.
> I had to relax the terms index plugin api for
> PrefixCodedTermsReader/Writer to not assume that the terms index impl
> supports ord.
> I also did some cleanup of the FST/FSTEnum APIs and impls, and broke
> out separate seekCeil and seekFloor in FSTEnum.  Eg we need seekFloor
> when the FST is used as a terms index but seekCeil when it's holding
> all terms in the index (ie which SimpleText uses FSTs for).

-- 
This message is automatically generated by JIRA.
-
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] Commented: (LUCENE-2843) Add variable-gap terms index impl.

Reply via email to