[jira] [Commented] (LUCENE-3069) Lucene should have an entirely memory resident term dictionary

Michael McCandless (JIRA) Mon, 15 Jul 2013 06:53:41 -0700

    [ 
https://issues.apache.org/jira/browse/LUCENE-3069?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13708465#comment-13708465
 ]


Michael McCandless commented on LUCENE-3069:
--------------------------------------------

The new code on the branch looks great!  I can't wait to see perf results after 
we
implement .intersect()..

Some small stuff in TempFSTTermsReader.java:

  * In next(), when we handle seekPending=true, I think we should assert
    that the seekCeil returned SeekStatus.FOUND?  Ie, it's not
    possible to seekExact(TermState) to a term that doesn't exist.

  * useCache is an ancient option from back when we had a terms dict
    cache; we long ago removed it ... I think we should remove
    useCache parameter too?

  * It's silly that fstEnum.seekCeil doesn't return a status, ie that
    we must re-compare the term we got to differentiate FOUND vs
    NOT_FOUND ... so we lose some perf here.  But this is just a
    future TODO ...

  * "nocommit: this method doesn't act as 'seekExact' right?" -- not
    sure why this is here; seekExact is working as it should I think.

  * Maybe instead of term and meta members, we could just hold the
    current pair?

In TempTermOutputs.java:

  * longsSize, hasPos can be final?  (Same with TempMetaData's fields)

  * TempMetaData.hashCode() doesn't mix in docFreq/tTF?

  * It doesn't impl equals (must it really impl hashCode?)

                
> Lucene should have an entirely memory resident term dictionary
> --------------------------------------------------------------
>
>                 Key: LUCENE-3069
>                 URL: https://issues.apache.org/jira/browse/LUCENE-3069
>             Project: Lucene - Core
>          Issue Type: Improvement
>          Components: core/index, core/search
>    Affects Versions: 4.0-ALPHA
>            Reporter: Simon Willnauer
>            Assignee: Han Jiang
>              Labels: gsoc2013
>             Fix For: 4.4
>
>         Attachments: df-ttf-estimate.txt, example.png, LUCENE-3069.patch
>
>
> FST based TermDictionary has been a great improvement yet it still uses a 
> delta codec file for scanning to terms. Some environments have enough memory 
> available to keep the entire FST based term dict in memory. We should add a 
> TermDictionary implementation that encodes all needed information for each 
> term into the FST (custom fst.Output) and builds a FST from the entire term 
> not just the delta.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] [Commented] (LUCENE-3069) Lucene should have an entirely memory resident term dictionary

Reply via email to