[
https://issues.apache.org/jira/browse/LUCENE-3069?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13707649#comment-13707649
]
Han Jiang commented on LUCENE-3069:
-----------------------------------
bq. Cool idea! I wonder how many of those are df == ttf == 1?
I didn't try a very precise estimation, but the percentage will be large:
For the index of wikimedium1m, the larget segment has a 'body' field with:
{noformat}
bitwidth/7 df==ttf df
1 1324400 / 1542987
2 110 / 18951
3 0 / 175
4 0 / 0
5 0 / 0
{noformat}
That is where 85.8% comes. 'bitwidth/7' means the 'ceil(bitwidth of df / 7)'
since we're using VInt encoding.
So, for this field, we can save (1324400+110*2) bytes by stealing one bit.
bq. Maybe we could try writing a vInt of 0 for docFreq to indicate that both
docFreq and totalTermFreq are 1?
Yes, that may helps! I'll try to test the percentage. But still we should note
that, df is a small part in term dict data.
> Lucene should have an entirely memory resident term dictionary
> --------------------------------------------------------------
>
> Key: LUCENE-3069
> URL: https://issues.apache.org/jira/browse/LUCENE-3069
> Project: Lucene - Core
> Issue Type: Improvement
> Components: core/index, core/search
> Affects Versions: 4.0-ALPHA
> Reporter: Simon Willnauer
> Assignee: Han Jiang
> Labels: gsoc2013
> Fix For: 4.4
>
> Attachments: example.png, LUCENE-3069.patch
>
>
> FST based TermDictionary has been a great improvement yet it still uses a
> delta codec file for scanning to terms. Some environments have enough memory
> available to keep the entire FST based term dict in memory. We should add a
> TermDictionary implementation that encodes all needed information for each
> term into the FST (custom fst.Output) and builds a FST from the entire term
> not just the delta.
--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]