[ https://issues.apache.org/jira/browse/LUCENE-3069?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13707649#comment-13707649 ]
Han Jiang commented on LUCENE-3069: ----------------------------------- bq. Cool idea! I wonder how many of those are df == ttf == 1? I didn't try a very precise estimation, but the percentage will be large: For the index of wikimedium1m, the larget segment has a 'body' field with: {noformat} bitwidth/7 df==ttf df 1 1324400 / 1542987 2 110 / 18951 3 0 / 175 4 0 / 0 5 0 / 0 {noformat} That is where 85.8% comes. 'bitwidth/7' means the 'ceil(bitwidth of df / 7)' since we're using VInt encoding. So, for this field, we can save (1324400+110*2) bytes by stealing one bit. bq. Maybe we could try writing a vInt of 0 for docFreq to indicate that both docFreq and totalTermFreq are 1? Yes, that may helps! I'll try to test the percentage. But still we should note that, df is a small part in term dict data. > Lucene should have an entirely memory resident term dictionary > -------------------------------------------------------------- > > Key: LUCENE-3069 > URL: https://issues.apache.org/jira/browse/LUCENE-3069 > Project: Lucene - Core > Issue Type: Improvement > Components: core/index, core/search > Affects Versions: 4.0-ALPHA > Reporter: Simon Willnauer > Assignee: Han Jiang > Labels: gsoc2013 > Fix For: 4.4 > > Attachments: example.png, LUCENE-3069.patch > > > FST based TermDictionary has been a great improvement yet it still uses a > delta codec file for scanning to terms. Some environments have enough memory > available to keep the entire FST based term dict in memory. We should add a > TermDictionary implementation that encodes all needed information for each > term into the FST (custom fst.Output) and builds a FST from the entire term > not just the delta. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira --------------------------------------------------------------------- To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org