[jira] [Commented] (LUCENE-3069) Lucene should have an entirely memory resident term dictionary

Robert Muir (JIRA) Fri, 12 Jul 2013 18:59:51 -0700

    [ 
https://issues.apache.org/jira/browse/LUCENE-3069?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13707634#comment-13707634
 ]


Robert Muir commented on LUCENE-3069:
-------------------------------------

{quote}
Also, for long-tail terms, the totalTermFreq
an safely be inlined into docFreq (for body field in wikimedium1m, 85.8% terms 
have df == ttf).
{quote}

Cool idea! I wonder how many of those are df == ttf == 1?

We would currently waste a byte in this case (because we write a vInt for 
docFreq of 1, and then a vInt of totalTermFreq - docFreq of 0).

Maybe we could try writing a vInt of 0 for docFreq to indicate that both 
docFreq and totalTermFreq are 1?

                
> Lucene should have an entirely memory resident term dictionary
> --------------------------------------------------------------
>
>                 Key: LUCENE-3069
>                 URL: https://issues.apache.org/jira/browse/LUCENE-3069
>             Project: Lucene - Core
>          Issue Type: Improvement
>          Components: core/index, core/search
>    Affects Versions: 4.0-ALPHA
>            Reporter: Simon Willnauer
>            Assignee: Han Jiang
>              Labels: gsoc2013
>             Fix For: 4.4
>
>         Attachments: example.png, LUCENE-3069.patch
>
>
> FST based TermDictionary has been a great improvement yet it still uses a 
> delta codec file for scanning to terms. Some environments have enough memory 
> available to keep the entire FST based term dict in memory. We should add a 
> TermDictionary implementation that encodes all needed information for each 
> term into the FST (custom fst.Output) and builds a FST from the entire term 
> not just the delta.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] [Commented] (LUCENE-3069) Lucene should have an entirely memory resident term dictionary

Reply via email to