[ https://issues.apache.org/jira/browse/LUCENE-3069?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13707709#comment-13707709 ]
Han Jiang edited comment on LUCENE-3069 at 7/13/13 10:05 AM: ------------------------------------------------------------- I did a checkindex on wikimediumall.trunk.Lucene41.nd33.3326M: Here is the bit width summary for "body" field: ||bit||#(df==ttf)||#df||#ttf|| | 1| 43532656 | 48860170| 43532656| | 2| 10328824 | 13979539| 16200377| | 3| 2682453 | 5032450| 6532755| | 4| 836109 | 2471794| 3134437| | 5| 262696 | 1324704| 1718862| | 6| 86487 | 755797| 990563| | 7| 29276 | 442974| 571996| | 8| 11257 | 263874| 339382| | 9| 4627 | 161402| 205662| |10| 2060 | 102198| 128034| |11| 979 | 63955| 79531| |12| 386 | 39377| 48805| |13| 170 | 24321| 30113| |14| 65 | 14686| 18437| |15| 10 | 9055| 10918| |16| 2 | 5229| 6821| |17| 0 | 2669| 3595| |18| 0 | 1312| 1897| |19| 0 | 696| 914| |20| 0 | 209| 509| |21| 0 | 44| 148| |22| 0 | 4| 38| |23| 0 | 0| 8| |24| 0 | 0| 1| |...|0|0|0| |tot|57778057|73556459|73556459| So we have 66.4% docFreq with df==1, and 78.5% with df==ttf. Considering different bit size, for df+ttf encoding, totally it saves 57.3MB from 148.7MB, using following estimation: {noformat} old_size = col[2] * vIntByteSize(rownumber) + col[3] * vIntByteSize(rownumber) new_size = col[2] * vIntByteSize(rownumber+1) + (col[3] - col[1]) * vIntByteSize(rownumber) {noformat} By the way, I am quite lured to omit frq blocks in Luene41PostingsReader. When we know that df==ttf, we can always make sure the in-doc frq==1. So for example, when bit width ranges from 2 to 8(inclusive), since df is not large enough to create ForBlocks, we have to VInt encode each in-doc freq. For this 'body' field, I think the index size we can reduce is about 67.5MB (here I only consider vInt block, since 1-bit ForBlock is usually small). For all the fields in wikimediumall, we can save 60.8MB from 245.2MB (for df+ttf only). While the vInt frq block we can omit from PBF is about 95.8MB, I suppose. I'll test this later. was (Author: billy): I did a checkindex on wikimediumall.trunk.Lucene41.nd33.3326M: Here is the bit width summary for "body" field: ||bit||#(df==ttf)||#df||#ttf|| | 1| 43532656 | 48860170| 43532656| | 2| 10328824 | 13979539| 16200377| | 3| 2682453 | 5032450| 6532755| | 4| 836109 | 2471794| 3134437| | 5| 262696 | 1324704| 1718862| | 6| 86487 | 755797| 990563| | 7| 29276 | 442974| 571996| | 8| 11257 | 263874| 339382| | 9| 4627 | 161402| 205662| |10| 2060 | 102198| 128034| |11| 979 | 63955| 79531| |12| 386 | 39377| 48805| |13| 170 | 24321| 30113| |14| 65 | 14686| 18437| |15| 10 | 9055| 10918| |16| 2 | 5229| 6821| |17| 0 | 2669| 3595| |18| 0 | 1312| 1897| |19| 0 | 696| 914| |20| 0 | 209| 509| |21| 0 | 44| 148| |22| 0 | 4| 38| |23| 0 | 0| 8| |24| 0 | 0| 1| |25| 0 | 0| 0| |26| 0 | 0| 0| |27| 0 | 0| 0| |28| 0 | 0| 0| |29| 0 | 0| 0| |30| 0 | 0| 0| |31| 0 | 0| 0| |32| 0 | 0| 0| |...|0|0|0| |tot|57778057|73556459|73556459| So we have 66.4% docFreq with df==1, and 78.5% with df==ttf. Considering different bit size, for df+ttf encoding, totally it saves 57.3MB from 148.7MB, using following estimation: {noformat} old_size = col[2] * vIntByteSize(rownumber) + col[3] * vIntByteSize(rownumber) new_size = col[2] * vIntByteSize(rownumber+1) + (col[3] - col[1]) * vIntByteSize(rownumber) {noformat} By the way, I am quite lured to omit frq blocks in Luene41PostingsReader. When we know that df==ttf, we can always make sure the in-doc frq==1. So for example, when bit width ranges from 2 to 8(inclusive), since df is not large enough to create ForBlocks, we have to VInt encode each in-doc freq. For this 'body' field, I think the index size we can reduce is about 67.5MB (here I only consider vInt block, since 1-bit ForBlock is usually small). For all the fields in wikimediumall, we can save 60.8MB from 245.2MB (for df+ttf only). While the vInt frq block we can omit from PBF is about 95.8MB, I suppose. I'll test this later. > Lucene should have an entirely memory resident term dictionary > -------------------------------------------------------------- > > Key: LUCENE-3069 > URL: https://issues.apache.org/jira/browse/LUCENE-3069 > Project: Lucene - Core > Issue Type: Improvement > Components: core/index, core/search > Affects Versions: 4.0-ALPHA > Reporter: Simon Willnauer > Assignee: Han Jiang > Labels: gsoc2013 > Fix For: 4.4 > > Attachments: example.png, LUCENE-3069.patch > > > FST based TermDictionary has been a great improvement yet it still uses a > delta codec file for scanning to terms. Some environments have enough memory > available to keep the entire FST based term dict in memory. We should add a > TermDictionary implementation that encodes all needed information for each > term into the FST (custom fst.Output) and builds a FST from the entire term > not just the delta. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira --------------------------------------------------------------------- To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org